Beautiful Soup – find vs find_all

Beautiful Soup – find vs find_all ”; Previous Next Beautiful Soup library includes find() as well as find_all() methods. Both methods are one of the most frequently used methods while parsing HTML or XML documents. From a particular document tree You often need to locate a PageElement of a certain tag type, or having certain attributes, or having a certain CSS style etc. These criteria are given as argument to both find() and find_all() methods. The main point of difference between the two is that while find() locates the very first child element that satisfies the criteria, find_all() method searches for all the children elements of the criteria. The find() method is defined with following syntax − Syntax find(name, attrs, recursive, string, **kwargs) The name argument specifies a filter on tag name. With attrs, a filter on tag attribute values can be set up. The recursive argument forces a recursive search if it is True. You can pass variable kwargs as dictionary of filters on attribute values. soup.find(id = ”nm”) soup.find(attrs={“name”:”marks”}) The find_all() method takes all the arguments as for the find() method, in addition there is a limit argument. It is an integer, restricting the search the specified number of occurrences of the given filter criteria. If not set, find_all() searches for the criteria among all the children under the said PageElement. soup.find_all(”input”) lst=soup.find_all(”li”, limit =2) If the limit argument for find_all() method is set to 1, it virtually acts as find() method. The return type of both the methods differs. The find() method returns either a Tag object or a NavigableString object first found. The find_all() method returns a ResultSet consisting of all the PageElements satisfying the filter criteria. Here is an example that demonstrates the difference between find and find_all methods. Example from bs4 import BeautifulSoup markup =open(“index.html”) soup = BeautifulSoup(markup, ”html.parser”) ret1 = soup.find(”input”) ret2 = soup.find_all (”input”) print (ret1, ”Return type of find:”, type(ret1)) print (ret2) print (”Return tyoe find_all:”, type(ret2)) #set limit =1 ret3 = soup.find_all (”input”, limit=1) print (”find:”, ret1) print (”find_all:”, ret3) Output <input id=”nm” name=”name” type=”text”/> Return type of find: <class ”bs4.element.Tag”> [<input id=”nm” name=”name” type=”text”/>, <input id=”age” name=”age” type=”text”/>, <input id=”marks” name=”marks” type=”text”/>] Return tyoe find_all: <class ”bs4.element.ResultSet”> find: <input id=”nm” name=”name” type=”text”/> find_all: [<input id=”nm” name=”name” type=”text”/>] Print Page Previous Next Advertisements ”;

Beautiful Soup – Encoding

Beautiful Soup – Encoding ”; Previous Next All HTML or XML documents are written in some specific encoding like ASCII or UTF-8. However, when you load that HTML/XML document into BeautifulSoup, it has been converted to Unicode. Example from bs4 import BeautifulSoup markup = “<p>I will display £</p>” soup = BeautifulSoup(markup, “html.parser”) print (soup.p) print (soup.p.string) Output <p>I will display £</p> I will display £ Above behavior is because BeautifulSoup internally uses the sub-library called Unicode, Dammit to detect a document”s encoding and then convert it into Unicode. However, not all the time, the Unicode, Dammit guesses correctly. As the document is searched byte-by-byte to guess the encoding, it takes lot of time. You can save some time and avoid mistakes, if you already know the encoding by passing it to the BeautifulSoup constructor as from_encoding. Below is one example where the BeautifulSoup misidentifies, an ISO-8859-8 document as ISO-8859-7 − Example from bs4 import BeautifulSoup markup = b”<h1>xedxe5xecxf9</h1>” soup = BeautifulSoup(markup, ”html.parser”) print (soup.h1) print (soup.original_encoding) Output <h1>翴檛</h1> ISO-8859-7 To resolve above issue, pass it to BeautifulSoup using from_encoding − Example from bs4 import BeautifulSoup markup = b”<h1>xedxe5xecxf9</h1>” soup = BeautifulSoup(markup, “html.parser”, from_encoding=”iso-8859-8″) print (soup.h1) print (soup.original_encoding) Output <h1>םולש</h1> iso-8859-8 Another new feature added from BeautifulSoup 4.4.0 is, exclude_encoding. It can be used, when you don”t know the correct encoding but sure that Unicode, Dammit is showing wrong result. soup = BeautifulSoup(markup, exclude_encodings=[“ISO-8859-7”]) Output encoding The output from a BeautifulSoup is UTF-8 document, irrespective of the entered document to BeautifulSoup. Below a document, where the polish characters are there in ISO-8859-2 format. Example markup = “”” <!DOCTYPE HTML PUBLIC “-//W3C//DTD HTML 4.01 Transitional//EN”> <HTML> <HEAD> <META HTTP-EQUIV=”content-type” CONTENT=”text/html; charset=iso-8859-2″> </HEAD> <BODY> ą ć ę ł ń ó ś ź ż Ą Ć Ę Ł Ń Ó Ś Ź Ż </BODY> </HTML> “”” from bs4 import BeautifulSoup soup = BeautifulSoup(markup, “html.parser”, from_encoding=”iso-8859-8″) print (soup.prettify()) Output <!DOCTYPE HTML PUBLIC “-//W3C//DTD HTML 4.01 Transitional//EN”> <html> <head> <meta content=”text/html; charset=utf-8″ http-equiv=”content-type”/> </head> <body> ą ć ę ł ń ó ś ź ż Ą Ć Ę Ł Ń Ó Ś Ź Ż </body> </html> In the above example, if you notice, the <meta> tag has been rewritten to reflect the generated document from BeautifulSoup is now in UTF-8 format. If you don”t want the generated output in UTF-8, you can assign the desired encoding in prettify(). print(soup.prettify(“latin-1″)) Output b”<!DOCTYPE HTML PUBLIC “-//W3C//DTD HTML 4.01 Transitional//EN”>n<html>n <head>n <meta content=”text/html; charset=latin-1″ http-equiv=”content-type”/>n </head>n <body>n ą ć ę ł ń xf3 ś ź ż Ą Ć Ę Ł Ń xd3 Ś Ź Żn </body>n</html>n” In the above example, we have encoded the complete document, however you can encode, any particular element in the soup as if they were a python string − soup.p.encode(“latin-1”) soup.h1.encode(“latin-1″) Output b”<p>My first paragraph.</p>” b”<h1>My First Heading</h1>” Any characters that can”t be represented in your chosen encoding will be converted into numeric XML entity references. Below is one such example − markup = u”<b>N{SNOWMAN}</b>” snowman_soup = BeautifulSoup(markup) tag = snowman_soup.b print(tag.encode(“utf-8″)) Output b”<b>xe2x98x83</b>” If you try to encode the above in “latin-1” or “ascii”, it will generate “&#9731”, indicating there is no representation for that. print (tag.encode(“latin-1”)) print (tag.encode(“ascii”)) Output b”<b>☃</b>” b”<b>☃</b>” Unicode, Dammit Unicode, Dammit is used mainly when the incoming document is in unknown format (mainly foreign language) and we want to encode in some known format (Unicode) and also we don”t need Beautifulsoup to do all this. Print Page Previous Next Advertisements ”;

Behave – Introduction

Behave – Introduction ”; Previous Next Behave is a tool used for Behaviour driven development (BDD) in Python programming language. In an Agile development framework, BDD creates a culture where testers, developers, business analysts, and other stakeholders of the project can contribute towards the software development. In short, both technical and non-technical individuals have a role to play towards the overall project. Behave has tests developed in plain text with the implementation logic in Python. The BDD format begins with the description of the characteristics of the software similar to a story. It then continues with the development and carries out the following tasks − Developing a failing test case for characteristics. Implement the logic for a test to pass. Code refactor to fulfil the project guidelines. There are numerous libraries for BDD like the Mocha which supports JavaScript, Cucumber which supports Java/Ruby, and Behave which supports Python, and so on. In this tutorial, we shall discuss in detail about Behave. Let us see a basic structure of a BDD. It mainly consists of the feature file, the step definition file, and so on. Feature File The feature file in Behave can be as follows − Feature − Verify book name added in Library. Scenario − Verify Book name. Given − Book details. Then − Verify book name. Corresponding step definition file Following is the corresponding definition file in Behave tool − from behave import * @given(”Book details”) def impl_bk(context): print(”Book details entered”) @then(”Verify book name”) def impl_bk(context): print(”Verify book name”) Output The output obtained after running the feature file is as follows − The output shows the Feature and Scenario names, along with the test results, and the duration of the respective test execution. Print Page Previous Next Advertisements ”;

Behave – Multiline Text

Behave – Multiline Text ”; Previous Next A block of text after a step enclosed in “”” will be linked with that step. Here, the indentation is parsed. All the whitespaces at the beginning are removed from the text and all the succeeding lines must have at least a minimum whitespace as the starting line. A text is accessible to the implementation Python code with the .text attribute within the context variable (passed in the step function). Feature File The feature file for feature titled User information is as follows − Feature − User information Scenario − Check login functionality Given user enters name and password “”” Tutorialspoint Behave Topic – Multiline Text “”” Then user should be logged in Corresponding Step Implementation File The corresponding step implementation file for the feature is as follows − from behave import * @given(”user enters name and password”) def step_impl(context): #access multiline text with .text attribute print(“Multiline Text: ” + context.text) @then(”user should be logged in”) def step_impl(context): pass Output The output obtained after running the feature file is mentioned below and the command used is behave –no-capture -f plain. The output shows the multiline text printed. Print Page Previous Next Advertisements ”;

Behave – Feature Testing Setup

Behave – Feature Testing Setup ”; Previous Next Behave works with three different file types, which are as follows − Feature files which are created by a Business analyst or any project stakeholder and contains behaviour related use cases. Step Implementation file for the scenarios defined in the feature file. Environment Setup files where, the pre/post conditions are to be executed prior and post the steps, features, scenarios, and so on. Feature File A feature file should be within a folder called as the features. Also, there should be a sub-directory steps within the features directory. Launching Feature file We can launch the feature file with various command line arguments. These are explained below − If no information is available, all the feature files within the features directory shall be loaded for execution in Behave. If the path of the features directory is provided, then it is mandatory to have at least one feature file (with .feature extension) and a sub-directory named steps within the features directory. Also, if the environment.py is present, it should be within the directory that has the steps directory and not within the steps directory. If the path to a feature file is provided, then it instructs Behave to search for it. To get the corresponding steps directory for that feature file, the parent directory is searched. If not found in the current parent directory, then it searches its parents. This shall continue until it reaches the file system root. Also, if the environment.py is present, it should be within the directory that has the steps directory and not within the steps directory. Print Page Previous Next Advertisements ”;

Beautiful Soup – Extract Email IDs

Beautiful Soup – Extract Email IDs ”; Previous Next To Extract Email addresses from a web page is an important application a web scraping library such as BeautifulSoup. In any web page, the Email IDs usually appear in the href attribute of anchor <a> tag. The Email ID is written using mailto URL scheme. Many a times, the Email Address may be present in page content as a normal text (without any hyperlink). In this chapter, we shall use BeautifulSoup library to fetch Email IDs from HTML page, with simple techniques. A typical usage of Email ID in href attribute is as below − <a href = “mailto:[email protected]”>test link</a> In the first example, we shall consider the following HTML document for extracting the Email IDs from the hyperlinks − <html> <head> <title>BeautifulSoup – Scraping Email IDs</title> </head> <body> <h2>Contact Us</h2> <ul> <li><a href = “mailto:[email protected]”>Sales Enquiries</a></li> <li><a href = “mailto:[email protected]”>Careers</a></li> <li><a href = “mailto:[email protected]”>Partner with us</a></li> </ul> </body> </html> Here”s the Python code that finds the Email Ids. We collect all the <a> tags in the document, and check if the tag has href attribute. If true, the part of its value after 6th character is the email Id. from bs4 import BeautifulSoup import re fp = open(“contact.html”) soup = BeautifulSoup(fp, “html.parser”) tags = soup.find_all(“a”) for tag in tags: if tag.has_attr(“href”) and tag[”href”][:7]==”mailto:”: print (tag[”href”][7:]) For the given HTML document, the Email IDs will be extracted as follows − [email protected] [email protected] [email protected] In the second example, we assume that the Email IDs appear anywhere in the text. To extract them, we use the regex searching mechanism. Regex is a complex character pattern. Python”s re module helps in processing the regex (Regular Expression) patterns. The following regex pattern is used for searching the email address − pat = r”[w.+-]+@[w-]+.[w.-]+” For this exercise, we shall use the following HTML document, having Email IDs in <li>tags. <html> <head> <title>BeautifulSoup – Scraping Email IDs</title> </head> <body> <h2>Contact Us</h2> <ul> <li>Sales Enquiries: [email protected]</a></li> <li>Careers: [email protected]</a></li> <li>Partner with us: [email protected]</a></li> </ul> </body> </html> Using the email regex, we”ll find the appearance of the pattern in each <li> tag string. Here is the Python code − Example from bs4 import BeautifulSoup import re def isemail(s): pat = r”[w.+-]+@[w-]+.[w.-]+” grp=re.findall(pat,s) return (grp) fp = open(“contact.html”) soup = BeautifulSoup(fp, “html.parser”) tags = soup.find_all(”li”) for tag in tags: emails = isemail(tag.string) if emails: print (emails) Output [”[email protected]”] [”[email protected]”] [”[email protected]”] Using the simple techniques described above, we can use BeautifulSoup to extract Email IDs from web pages. Print Page Previous Next Advertisements ”;

Beautiful Soup – Pretty Printing

Beautiful Soup – Pretty Printing ”; Previous Next To display the entire parsed tree of an HTML document or the contents of a specific tag, you can use the print() function or call str() function as well. Example from bs4 import BeautifulSoup soup = BeautifulSoup(“<h1>Hello World</h1>”, “lxml”) print (“Tree:”,soup) print (“h1 tag:”,str(soup.h1)) Output Tree: <html><body><h1>Hello World</h1></body></html> h1 tag: <h1>Hello World</h1> The str() function returns a string encoded in UTF-8. To get a nicely formatted Unicode string, use Beautiful Soup”s prettify() method. It formats the Beautiful Soup parse tree so that there each tag is on its own separate line with indentation. It allows to you to easily visualize the structure of the Beautiful Soup parse tree. Consider the following HTML string. <p>The quick, <b>brown fox</b> jumps over a lazy dog.</p> Using the prettify() method we can better understand its structure − html = ””” <p>The quick, <b>brown fox</b> jumps over a lazy dog.</p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “lxml”) print (soup.prettify()) Output <html> <body> <p> The quick, <b> brown fox </b> jumps over a lazy dog. </p> </body> </html> You can call prettify() on on any of the Tag objects in the document. print (soup.b.prettify()) Output <b> brown fox </b> The prettify() method is for understanding the structure of the document. However, it should not be used to reformat it, as it adds whitespace (in the form of newlines), and changes the meaning of an HTML document. He prettify() method can optionally be provided formatter argument to specify the formatting to be used. Print Page Previous Next Advertisements ”;

Beautiful Soup – Get Tag Position

Beautiful Soup – Get Tag Position ”; Previous Next The Tag object in Beautiful Soup possesses two useful properties that give the information about its position in the HTML document. They are − sourceline − line number at which the tag is found sourcepos − The starting index of the tag in the line in which it is found. These properties are supported by the html.parser which is Python”s in-built parser and html5lib parser. They are not available when you are using lmxl parser. In the following example, a HTML string is parsed with html.parser and we find the line number and position of <p> tag in the HTML string. Example html = ””” <html> <body> <p>Web frameworks</p> <ul> <li>Django</li> <li>Flask</li> </ul> <p>GUI frameworks</p> <ol> <li>Tkinter</li> <li>PyQt</li> </ol> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) p_tags = soup.find_all(”p”) for p in p_tags: print (p.sourceline, p.sourcepos, p.string) Output 4 0 Web frameworks 9 0 GUI frameworks For html.parser, these numbers represent the position of the initial less-than sign, which is 0 in this example. It is slightly different when html5lib parser is used. Example html = ””” <html> <body> <p>Web frameworks</p> <ul> <li>Django</li> <li>Flask</li> </ul> <p>GUI frameworks</p> <ol> <li>Tkinter</li> <li>PyQt</li> </ol> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html5lib”) li_tags = soup.find_all(”li”) for l in li_tags: print (l.sourceline, l.sourcepos, l.string) Output 6 3 Django 7 3 Flask 11 3 Tkinter 12 3 PyQt When using html5lib, the sourcepos property returns the position of the final greater-than sign. Print Page Previous Next Advertisements ”;

Beautiful Soup – Copying Objects

Beautiful Soup – Copying Objects ”; Previous Next To create a copy of any tag or NavigableString, use copy() function from the copy module from Python”s standard library. Example from bs4 import BeautifulSoup import copy markup = “<p>Learn <b>Python, Java</b>, <i>advanced Python and advanced Java</i>! from Tutorialspoint</p>” soup = BeautifulSoup(markup, “html.parser”) i1 = soup.find(”i”) icopy = copy.copy(i1) print (icopy) Output <i>advanced Python and advanced Java</i> Although the two copies (original and copied one) contain the same markup however, the two do not represent the same object. print (i1 == icopy) print (i1 is icopy) Output True False The copied object is completely detached from the original Beautiful Soup object tree, just as if extract() had been called on it. print (icopy.parent) Output None Print Page Previous Next Advertisements ”;

Beautiful Soup – Selecting nth Child

Beautiful Soup – Selecting nth Child ”; Previous Next HTML is characterized by the hierarchical order of tags. For example, the <html> tag encloses <body> tag, inside which there may be a <div> tag further may have <ul> and <li> elements nested respectively. The findChildren() method and .children property both return a ResultSet (list) of all the child tags directly under an element. By traversing the list, you can obtain the child located at a desired position, nth child. The code below uses the children property of a <div> tag in the HTML document. Since the return type of children property is a list iterator, we shall retrieve a Python list from it. We also need to remove the whitespaces and line breaks from the iterator. Once done, we can fetch the desired child. Here the child element with index 1 of the <div> tag is displayed. Example from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.div children = tag.children childlist = [child for child in children if child not in [”n”, ” ”]] print (childlist[1]) Output <p>Python</p> To use findChildren() method instead of children property, change the statement to children = tag.findChildren() There will be no change in the output. A more efficient approach toward locating nth child is with the select() method. The select() method uses CSS selectors to obtain required PageElements from the current element. The Soup and Tag objects support CSS selectors through their .css property, which is an interface to the CSS selector API. The selector implementation is handled by the Soup Sieve package, which gets installed along with bs4 package. The Soup Sieve package defines different types of CSS selectors, namely simple, compound and complex CSS selectors that are made up of one or more type selectors, ID selectors, class selectors. These selectors are defined in CSS language. There are pseudo class selectors as well in Soup Sieve. A CSS pseudo-class is a keyword added to a selector that specifies a special state of the selected element(s). We shall use :nth-child pseudo class selector in this example. Since we need to select a child from <div> tag at 2nd position, we shall pass :nthchild(2) to the select_one() method. Example from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.div child = tag.select_one(”:nth-child(2)”) print (child) Output <p>Python</p> We get the same result as with the findChildren() method. Note that the child numbering starts with 1 and not 0 as in case of a Python list. Print Page Previous Next Advertisements ”;