beautiful Soup Archives - Donotsad where can learn any thing work project and make money

Aug 09

Beautiful Soup – find_previous_sibling Method

Beautiful Soup – find_previous_sibling() Method ”; Previous Next Method Description The find_previous_sibling() method in Beautiful Soup returns the closest sibling to this PageElement that matches the given criteria and appears earlier in the document. Syntax find_previous_sibling(name, attrs, string, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. string − A filter for a NavigableString with specific text. kwargs − A dictionary of filters on attribute values. Return Value The find_previous_sibling() method returns a PageElement that could be a Tag or a NavigableString. Example 1 From the HTML string used in the following example, we find out the previous sibling of tag, having the tag name as ”u” from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(“ExcellentHelloPython”, ”html.parser”) tag = soup.i sibling = tag.find_previous_sibling(”u”) print (sibling) Output Excellent Example 2 The web page (index.html) has a HTML form with three input elements. We locate one with id attribute as marks and then find its previous sibling that had id set to nm. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.find(”input”, {”id”:”marks”}) sib = tag.find_previous_sibling(id=”nm”) print (sib) Output <input id=”nm” name=”name” type=”text”/> Example 3 In the code below, the HTML string has two elements and a string inside the outer tag. We use find_previous_string() method to search for the NavigableString object sibling of Tutorial tag. html = ””” ExcellentPythonTutorial ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.find(”p”, string=”Tutorial”) ptag = tag.find_previous_sibling(string=”Excellent”) print (ptag, type(ptag)) Output Excellent <class ”bs4.element.NavigableString”> Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – previous_element Property

Beautiful Soup – previous_element Property ”; Previous Next Method Description In Beautiful Soup library, the previous_element property returns the Tag or NavigableString that appears immediately prior to the current PageElement, even if it is out of the parent tree. There is also a previous property which has similar behaviour Syntax Element.previous_element Return value The previous_element and previous properties return a tag or a NavigableString appearing immediately before the current tag. Example 1 In the document tree parsed from the given HTML string, we find the previous_element of the tag html = ””” ExcellentPythonTutorial ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”lxml”) tag = soup.find(”p”, id=”id1”) print (tag) pre = tag.previous_element print (“Previous:”,pre) pre = tag.previous_element.previous_element print (“Previous:”,pre) Output Tutorial Previous: Python Previous: Python The output is a little strange as the previous element for shown to be ”Python, that is because the inner string is registered as the previous element. To obtain the desired result (Python) as the previous element, fetch the previous_element property of the inner NavigableString object. Example 2 The BeautifulSoup PageElements also supports previous property which is analogous to previous_element property html = ””” ExcellentPythonTutorial ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”lxml”) tag = soup.find(”p”, id=”id1”) print (tag) pre = tag.previous print (“Previous:”,pre) pre = tag.previous.previous print (“Previous:”,pre) Output Tutorial Previous: Python Previous: Python Example 3 In the next example, we try to determine the element next to <input> tag whose id attribute is ”age” from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html5lib”) tag = soup.find(”input”, id=”age”) pre = tag.previous_element.previous print (“Previous:”,pre) Output Previous: <input id=”nm” name=”name” type=”text”/> Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – children Property

Beautiful Soup – children Property ”; Previous Next Method Description The Tag object in Beautiful Soup library has children property. It returns a generator used to iterate over the immediate child elements and text nodes (i.e. Navigable String). Syntax Tag.children Return value The property returns a generator with which you can iterate over direct children of the PageElement. Example 1 from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> Java Python C++ </div> ””” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.div children = tag.children for child in children: print (child) Output Java Python C++ Example 2 The soup object too bears the children property. from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> Java Python C++ </div> ””” soup = BeautifulSoup(markup, ”html.parser”) children = soup.children for child in children: print (child) Output <div id=”Languages”> Java Python C++ </div> Example 3 In the following example, we append NavigableString objects to the Tag and get the list of children. from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> Java Python C++ </div> ””” soup = BeautifulSoup(markup, ”html.parser”) soup.p.extend([”and”, ”JavaScript”]) children = soup.p.children for child in children: print (child) Output Java and JavaScript Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – next_sibling Property

Beautiful Soup – next_sibling Property ”; Previous Next Method Description The HTML tags appearing at the same indentation level are called siblings. The next_sibling property of the PageElement returns next tag at the same level, or under the same parent. Syntax element.next_sibling Return type The next_sibling property returns a PageElement, a Tag or a NavigableString object. Example 1 The index.html wage page consists of a HTML form with three input elements each with a name attribute. In the following example, the next sibling of an input tag with name attribute as nm is located. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.find(”input”, {”name”:”age”}) print (tag.find_previous()) from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.find(”input”, {”id”:”nm”}) sib = tag.next_sibling print (sib) Output <input id=”nm” name=”name” type=”text”/> Example 2 In the next example, we have a HTML document with a couple of tags inside a tag. The next_sibling property returns the tag next to tag in it. from bs4 import BeautifulSoup soup = BeautifulSoup(“HelloPython”, ”html.parser”) tag1 = soup.b print (“next:”,tag1.next_sibling) Output next: Python Example 3 Consider the HTML string in the following document. It has two tags at the same level. The next_sibling of first should give the second tag”s contents. html = ””” HelloPython TutorialsPoint ””” soup = BeautifulSoup(html, ”html.parser”) tag1 = soup.p print (“next:”,tag1.next_sibling) Output next: The blank line after the word next: is unexpected. But that”s because of the n character after the first tag. Change the print statement as shown below to obtain the contents of the next_sibling tag1 = soup.p print (“next:”,tag1.next_sibling.next_sibling) Output next: TutorialsPoint Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – previous_elements Property

Beautiful Soup – previous_elements Property ”; Previous Next Method Description In Beautiful Soup library, the previous_elements property returns a generator object containing the previous strings or tags in the parse tree. Syntax Element.previous_elements Return value The previous_elements property returns a generator. Example 1 The previous_elements property returns tags and NavibaleStrings appearing before the tag in the document string below − html = ””” ExcellentPythonTutorial ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.find(”p”, id=”id1”) pres = tag.previous_elements print (“Previous elements:”) for pre in pres: print (pre) Output Previous elements: Python Python Excellent Excellent ExcellentPythonTutorial Example 2 All the elements appearing before the tag are listed below − from bs4 import BeautifulSoup html = ””” ExcellentPython Tutorial ””” soup = BeautifulSoup(html, ”html.parser”) tag1 = soup.find(”u”) print (“previous elements:”) print (list(tag1.previous_elements)) Output previous elements: [”n”, ”n”, ”Python”, Python, ”Excellent”, Excellent, ”n”, ExcellentPython , ”n”] Example 3 The BeautifulSoup object itself doesn”t have any previous elements − from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html5lib”) tag = soup.find(”input”, id=”marks”) pres = soup.previous_elements print (“Previous elements:”) for pre in pres: print (pre.name) Output Previous elements: Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – find Method

Beautiful Soup – find() Method ”; Previous Next Method Description The find() method in Beautiful Soup looks for the first Element that matches the given criteria in the children of this PageElement and returns it. Syntax Soup.find(name, attrs, recursive, string, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. recursive − If this is True, find() a recursive search will be performed. Otherwise, only the direct children will be considered. limit − Stop looking after specified number of occurrences have been found. kwargs − A dictionary of filters on attribute values. Return value The find() method returns Tag object or a NavigableString object Example 1 Let us use the following HTML script (as index.html) for the purpose <html> <head> <title>TutorialsPoint</title> </head> <body> <form> <input type = ”text” id = ”nm” name = ”name”> <input type = ”text” id = ”age” name = ”age”> <input type = ”text” id = ”marks” name = ”marks”> </form> </body> </html> The following Python code finds the element with its id as nm from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) obj = soup.find(id = ”nm”) print (obj) Output <input id=”nm” name=”name” type=”text”/> Example 2 The find() method returns the first tag in the parsed document that has the given attributes. obj = soup.find(attrs={“name”:”marks”}) Output <input id=”marks” name=”marks” type=”text”/> Example 3 If find() can”t find anything, it returns None obj = soup.find(”dummy”) print (obj) Output None Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – previous_sibling Property

Beautiful Soup – previous_sibling Property ”; Previous Next Method Description The HTML tags appearing at the same indentation level are called siblings. The previous_sibling property of the PageElement returns a previous tag (a tag appearing before the current tag) at the same level, or under the same parent. This property encapsulates the find_previous_sibling() method. Syntax element.previous_sibling Return type The previous_sibling property returns a PageElement, a Tag or a NavigableString object. Example 1 In the following code, the HTML string consists of two adjacent tags inside a tag. It shows the sibling tag for tag appearing before it. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(“HelloPython”, ”html.parser”) tag = soup.i sibling = tag.previous_sibling print (sibling) Output Hello Example 2 We are using the index.html file for parsing. The page contains a HTML form with three input elements. Which element is a previous sibling of input element with its id attribute as age? The following code shows it − from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.find(”input”, {”id”:”age”}) sib = tag.previous_sibling.previous_sibling print (sib) Output <input id=”nm” name=”name” type=”text”/> Example 3 First we find the tag containing the string ”Tutorial” and then fins a tag previous to it. html = ””” ExcellentPythonTutorial ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.find(”p”, string=”Tutorial”) print (tag.previous_sibling) Output Python Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – next_siblings Property

Beautiful Soup – next_siblings Property ”; Previous Next Method Description The HTML tags appearing at the same indentation level are called siblings. The next_siblings property in Beautiful Soup returns returns a generator object used to iterate over all the subsequent tags and strings under the same parent. Syntax element.next_siblings Return type The next_siblings property returns a generator of sibling PageElements. Example 1 In HTML form code in index.html contains three input elements. Following script uses next_siblings property to collect next siblings of an input element wit id attribute as nm from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.find(”input”, {”id”:”nm”}) siblings = tag.next_siblings print (list(siblings)) Output [”n”, <input id=”age” name=”age” type=”text”/>, ”n”, <input id=”marks” name=”marks” type=”text”/>, ”n”] Example 2 Let us use the following HTML snippet for this purpose − Use the following code to traverse next siblings tags. from bs4 import BeautifulSoup soup = BeautifulSoup(“ExcellentPythonTutorial”, ”html.parser”) tag1 = soup.b print (“next siblings:”) for tag in tag1.next_siblings: print (tag) Output next siblings: Python Tutorial Example 3 Next example shows that the <head> tag has only one next sibling in the form of body tag. html = ””” <html> <head> <title>Hello</title> </head> <body> ExcellentPythonTutorial </body> </head> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tags = soup.head.next_siblings print (“next siblings:”) for tag in tags: print (tag) Output next siblings: <body> ExcellentPythonTutorial </body> The additional lines are because of the linebreaks in the generator. Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – descendants Property

Beautiful Soup – descendants Property ”; Previous Next Method Description With the descendants property of a PageElement object in Beautiful Soup API you can traverse the list of all children under it. This property returns a generator object, with which the children elements can be retrieved in a breadth-first sequence. While searching a tree structure, the Breadth-first traversal starts at the tree root and explores all nodes at the present depth prior to moving on to the nodes at the next depth level. Syntax tag.descendants Return value The descendants property returns a generator object. Example 1 In the code below, we have a HTML document with nested unordered list tags. We scrape through the children elements parsed in breadth-first manner. html = ””” <ul id=”outer”> <li class=”mainmenu”>Accounts</li> <ul> <li class=”submenu”>Anand</li> <li class=”submenu”>Mahesh</li> </ul> <li class=”mainmenu”>HR</li> <ul> <li class=”submenu”>Anil</li> <li class=”submenu”>Milind</li> </ul> </ul> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.find(”ul”, {”id”: ”outer”}) tags = soup.descendants for desc in tags: print (desc) Output <ul id=”outer”> <li class=”mainmenu”>Accounts</li> <ul> <li class=”submenu”>Anand</li> <li class=”submenu”>Mahesh</li> </ul> <li class=”mainmenu”>HR</li> <ul> <li class=”submenu”>Anil</li> <li class=”submenu”>Milind</li> </ul> </ul> <li class=”mainmenu”>Accounts</li> Accounts <ul> <li class=”submenu”>Anand</li> <li class=”submenu”>Mahesh</li> </ul> <li class=”submenu”>Anand</li> Anand <li class=”submenu”>Mahesh</li> Mahesh <li class=”mainmenu”>HR</li> HR <ul> <li class=”submenu”>Anil</li> <li class=”submenu”>Milind</li> </ul> <li class=”submenu”>Anil</li> Anil <li class=”submenu”>Milind</li> Milind Example 2 In the following example, we list out the descendants of <head> tag html = “”” <html><head><title>TutorialsPoint</title></head> <body> Hello World “”” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.head for element in tag.descendants: print (element) Output <title>TutorialsPoint</title> TutorialsPoint Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – strings Property

Beautiful Soup – strings Property ”; Previous Next Method Description For any PageElement having more than one children, the inner text of each can be fetched by the strings property. Unlike the string property, strings handles the case when the element contains multiple children. The strings property returns a generator object. It yields a sequence of NavigableStrings corresponding to each of the child elements. Syntax Tag.strings Example 1 You can retrieve the value od strings property for soup as well as a tag object. In the following example, the soup object”s stings property is checked. from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> Java Python C++ </div> ””” soup = BeautifulSoup(markup, ”html.parser”) print ([string for string in soup.strings]) Output [”n”, ”n”, ”Java”, ” ”, ”Python”, ” ”, ”C++”, ”n”, ”n”] Note the line breaks and white spaces in the list.We can remove them with stripped_strings property. Example 2 We now obtain a generator object returned by the strings property of <div> tag. With a loop, we print the strings. tag = soup.div navstrs = tag.strings for navstr in navstrs: print (navstr) Output Java Python C++ Note that the line breaks and whiteapces have appeared in the output, which can be removed with stripped_strings property. Print Page Previous Next Advertisements ”;