user, Author at Donotsad where can learn any thing work project and make money

Aug 09

Beautiful Soup – next_element Property

Beautiful Soup – next_element Property ”; Previous Next Method Description In Beautiful Soup library, the next_element property returns the Tag or NavigableString that appears immediately next to the current PageElement, even if it is out of the parent tree. There is also a next property which has similar behaviour Syntax Element.next_element Return value The next_element and next properties return a tag or a NavigableString appearing immediately next to the current tag. Example 1 In the document tree parsed from the given HTML string, we find the next_element of the <b> tag html = ””” <p><b>Excellent</b><p>Python</p><p id=”id1”>Tutorial</p></p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”lxml”) tag = soup.b print (tag) nxt = tag.next_element print (“Next:”,nxt) nxt = tag.next_element.next_element print (“Next:”,nxt) Output <b>Excellent</b> Next: Excellent Next: <p>Python</p> The output is a little strange as the next element for <b>Excellent</b> is shown to be ”Excellent”, that is because the inner string is registered as the next element. To obtain the desired result (<p>Python</p>) as the next element, fetch the next_element property of the inner NavigableString object. Example 2 The BeautifulSoup PageElements also support next property which is analogous to next_element property html = ””” <p><b>Excellent</b><p>Python</p><p id=”id1”>Tutorial</p></p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”lxml”) tag = soup.b print (tag) nxt = tag.next print (“Next:”,nxt) nxt = tag.next.next print (“Next:”,nxt) Output <b>Excellent</b> Next: Excellent Next: <p>Python</p> Example 3 In the next example, we try to determine the element next to <body> tag. As it is followed by a line break (n), we need to find the next element of the one next to body tag. It happens to be <h1> tag. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.find(”body”) nxt = tag.next_element.next print (“Next:”,nxt) Output Next: <h1>TutorialsPoint</h1> Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – find_parents Method

Beautiful Soup – find_parents() Method ”; Previous Next Method Description The find_parent() method in BeautifulSoup package finds all parents of this Element that matches the given criteria. Syntax find_parents( name, attrs, limit, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. limit − Stop looking after specified number of occurrences have been found. kwargs − A dictionary of filters on attribute values. Return Type The find_parents() method returns a ResultSet consisting of all the parent elements in a reverse order. Example 1 We shall use following HTML script in this example − <html> <body> <h2>Departmentwise Employees</h2> <ul id=”dept”> <li>Accounts</li> <ul id=”acc”> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ol id=”HR”> <li>Rani</li> <li>Ankita</li> </ol> </ul> </body> </html> Output ul body html [document] Note that the name property of BeautifulSoup object always returns [document]. Example 2 In this example, the limit argument is passed to find_parents() method to restrict the parent search to two levels up. from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) obj=soup.find(”li”) parents=obj.find_parents(limit=2) for parent in parents: print (parent.name) Output ul body Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – Web Scraping

Beautiful Soup – web-scraping ”; Previous Next Scraping is simply a process of extracting (from various means), copying and screening of data. When we scrape or extract data or feeds from the web (like from web-pages or websites), it is termed as web-scraping. So, web scraping (which is also known as web data extraction or web harvesting) is the extraction of data from web. In short, web scraping provides a way to the developers to collect and analyze data from the internet. Why Web-scraping? Web-scraping provides one of the great tools to automate most of the things a human does while browsing. Web-scraping is used in an enterprise in a variety of ways − Data for Research Smart analyst (like researcher or journalist) uses web scrapper instead of manually collecting and cleaning data from the websites. Products, prices & popularity comparison Currently there are couple of services which use web scrappers to collect data from numerous online sites and use it to compare products popularity and prices. SEO Monitoring There are numerous SEO tools such as Ahrefs, Seobility, SEMrush, etc., which are used for competitive analysis and for pulling data from your client”s websites. Search engines There are some big IT companies whose business solely depends on web scraping. Sales and Marketing The data gathered through web scraping can be used by marketers to analyze different niches and competitors or by the sales specialist for selling content marketing or social media promotion services. Why Python for Web Scraping? Python is one of the most popular languages for web scraping as it can handle most of the web crawling related tasks very easily. Below are some of the points on why to choose python for web scraping − Ease of Use As most of the developers agree that python is very easy to code. We don”t have to use any curly braces “{ }” or semi-colons “;” anywhere, which makes it more readable and easy-to-use while developing web scrapers. Huge Library Support Python provides huge set of libraries for different requirements, so it is appropriate for web scraping as well as for data visualization, machine learning, etc. Easily Explicable Syntax Python is a very readable programming language as python syntax are easy to understand. Python is very expressive and code indentation helps the users to differentiate different blocks or scopes in the code. Dynamically-typed language Python is a dynamically-typed language, which means the data assigned to a variable tells, what type of variable it is. It saves lot of time and makes work faster. Huge Community Python community is huge which helps you wherever you stuck while writing code. Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – Searching the Tree

Beautiful Soup – Searching the Tree ”; Previous Next In this chapter, we shall discuss different methods in Beautiful Soup for navigating the HTML document tree in different directions – going up and down, sideways, and back and forth. We shall use the following HTML string in all the examples in this chapter − html = “”” <html><head><title>TutorialsPoint</title></head> <body> <p class=”title”><b>Online Tutorials Library</b></p> <p class=”story”>TutorialsPoint has an excellent collection of tutorials on: <a href=”https://tutorialspoint.com/Python” class=”lang” id=”link1″>Python</a>, <a href=”https://tutorialspoint.com/Java” class=”lang” id=”link2″>Java</a> and <a href=”https://tutorialspoint.com/PHP” class=”lang” id=”link3″>PHP</a>; Enhance your Programming skills.</p> <p class=”tutorial”>…</p> “”” The name of required tag lets you navigate the parse tree. For example soup.head fetches you the <head> element − Example from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) print (soup.head.prettify()) Output <head> <title> TutorialsPoint </title> </head> Going down A tag may contain strings or other tags enclosed in it. The .contents property of Tag object returns a list of all the children elements belonging to it. Example from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.head print (list(tag.children)) Output [<title>TutorialsPoint</title>] The returned object is a list, although in this case, there is only a single child tag enclosed in head element. .children The .children property also returns a list of all the enclosed elements in a tag. Below, all the elements in body tag are given as a list. Example from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.body print (list(tag.children)) Output [”n”, <p class=”title”><b>Online Tutorials Library</b></p>, ”n”, <p class=”story”>TutorialsPoint has an excellent collection of tutorials on: <a class=”lang” href=”https://tutorialspoint.com/Python” id=”link1″>Python</a>, <a class=”lang” href=”https://tutorialspoint.com/Java” id=”link2″>Java</a> and <a class=”lang” href=”https://tutorialspoint.com/PHP” id=”link3″>PHP</a>; Enhance your Programming skills.</p>, ”n”, <p class=”tutorial”>…</p>, ”n”] Instead of getting them as a list, you can iterate over a tag”s children using the .children generator − Example tag = soup.body for child in tag.children: print (child) Output <p class=”title”><b>Online Tutorials Library</b></p> <p class=”story”>TutorialsPoint has an excellent collection of tutorials on: <a class=”lang” href=”https://tutorialspoint.com/Python” id=”link1″>Python</a>, <a class=”lang” href=”https://tutorialspoint.com/Java” id=”link2″>Java</a> and <a class=”lang” href=”https://tutorialspoint.com/PHP” id=”link3″>PHP</a>; Enhance your Programming skills.</p> <p class=”tutorial”>…</p> .descendents The .contents and .children attributes only consider a tag”s direct children. The .descendants attribute lets you iterate over all of a tag”s children, recursively: its direct children, the children of its direct children, and so on. The BeautifulSoup object is at the top of hierarchy of all the tags. Hence its .descendents property includes all the elements in the HTML string. Example from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) print (soup.descendants) The .descendents attribute returns a generator, which can be iterated with a for loop. Here, we list out the descendents of the head tag. Example from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.head for element in tag.descendants: print (element) Output <title>TutorialsPoint</title> TutorialsPoint The head tag contains a title tag, which in turn encloses a NavigableString object TutorialsPoint. The <head> tag has only one child, but it has two descendants: the <title> tag and the <title> tag”s child. But the BeautifulSoup object only has one direct child (the <html> tag), but it has many descendants. Example from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tags = list(soup.descendants) print (len(tags)) Output 27 Going Up Just as you navigate the downstream of a document with children and descendents properties, BeautifulSoup offers .parent and .parent properties to navigate the upstream of a tag .parent every tag and every string has a parent tag that contains it. You can access an element”s parent with the parent attribute. In our example, the <head> tag is the parent of the <title> tag. Example from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.title print (tag.parent) Output <head><title>TutorialsPoint</title></head> Since the title tag contains a string (NavigableString), the parent for the string is title tag itself. Example from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.title string = tag.string print (string.parent) Output <title>TutorialsPoint</title> .parents You can iterate over all of an element”s parents with .parents. This example uses .parents to travel from an <a> tag buried deep within the document, to the very top of the document. In the following code, we track the parents of the first <a> tag in the example HTML string. Example from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.a print (tag.string) for parent in tag.parents: print (parent.name) Output Python p body html [document] Sideways The HTML tags appearing at the same indentation level are called siblings. Consider the following HTML snippet <p> <b> Hello </b> <i> Python </i> </p> In the outer <p> tag, we have <b> and <i> tags at the same indent level, hence they are called siblings. BeautifulSoup makes it possible to navigate between the tags at same level. .next_sibling and .previous_sibling These attributes respectively return the next tag at the same level, and the previous tag at same level. Example from bs4 import BeautifulSoup soup = BeautifulSoup(“<p><b>Hello</b><i>Python</i></p>”, ”html.parser”) tag1 = soup.b print (“next:”,tag1.next_sibling) tag2 = soup.i print (“previous:”,tag2.previous_sibling) Output next: <i>Python</i> previous: <b>Hello</b> Since the <b> tag doesn”t have a sibling to its left, and <i> tag doesn”t have a sibling to its right, it returns Nobe in both cases. Example from bs4 import BeautifulSoup soup = BeautifulSoup(“<p><b>Hello</b><i>Python</i></p>”, ”html.parser”) tag1 = soup.b print

Aug 09

Beautiful Soup – Find Elements by Class

Beautiful Soup – Find Elements by Class ”; Previous Next CSS (cascaded Style sheets) is a tool for designing the appearance of HTML elements. CSS rules control the different aspects of HTML element such as size, color, alignment etc.. Applying styles is more effective than defining HTML element attributes. You can apply styling rules to each HTML element. Instead of applying style to each element individually, CSS classes are used to apply similar styling to groups of HTML elements to achieve uniform web page appearance. In BeautifulSoup, it is possible to find tags styled with CSS class. In this chapter, we shall use the following methods to search for elements for a specified CSS class − find_all() and find() methods select() and select_one() methods Class in CSS A class in CSS is a collection of attributes specifying the different features related to appearance, such as font type, size and color, background color, alignment etc. Name of the class is prefixed with a dot (.) while declaring it. .class { css declarations; } A CSS class may be defined inline, or in a separate css file which needs to be included in the HTML script. A typical example of a CSS class could be as follows − .blue-text { color: blue; font-weight: bold; } You can search for HTML elements defined with a certain class style with the help of following BeautifulSoup methods. For the purpose of this chapter, we shall use the following HTML page − <html> <head> <title>TutorialsPoint</title> </head> <body> <h2 class=”heading”>Departmentwise Employees</h2> <ul> <li class=”mainmenu”>Accounts</li> <ul> <li class=”submenu”>Anand</li> <li class=”submenu”>Mahesh</li> </ul> <li class=”mainmenu”>HR</li> <ul> <li class=”submenu”>Rani</li> <li class=”submenu”>Ankita</li> </ul> </ul> </body> </html> Using find() and find_all() To search for elements with a certain CSS class used in a tag, use attrs property of Tag object as follows − Example from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) obj = soup.find_all(attrs={“class”: “mainmenu”}) print (obj) Output [<li class=”mainmenu”>Accounts</li>, <li class=”mainmenu”>HR</li>] The result is a list of all the elements with mainmenu class To fetch the list of elements with any of the CSS classes mentioned in in attrs property, change the find_all() statement to − obj = soup.find_all(attrs={“class”: [“mainmenu”, “submenu”]}) This results into a list of all the elements with any of CSS classes used above. [ <li class=”mainmenu”>Accounts</li>, <li class=”submenu”>Anand</li>, <li class=”submenu”>Mahesh</li>, <li class=”mainmenu”>HR</li>, <li class=”submenu”>Rani</li>, <li class=”submenu”>Ankita</li> ] Using select() and select_one() You can also use select() method with the CSS selector as the argument. The (.) symbol followed by the name of the class is used as the CSS selector. Example from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) obj = soup.select(“.heading”) print (obj) Output [<h2 class=”heading”>Departmentwise Employees</h2>] The select_one() method returns the first element found with the given class. obj = soup.select_one(“.submenu”) Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – Home

Beautiful Soup Tutorial PDF Version Quick Guide Resources Job Search Discussion In this tutorial, we will show you, how to perform web scraping in Python using Beautiful Soup 4 for getting data out of HTML, XML and other markup languages. In this we will try to scrap webpage from various different websites (including IMDB). We will cover beautiful soup 4, python basic tools for efficiently and clearly navigating, searching and parsing HTML web page. We have tried to cover almost all the functionalities of Beautiful Soup 4 in this tutorial. You can combine multiple functionalities introduced in this tutorial into one bigger program to capture multiple meaningful data from the website into some other sub-program as input. Audience This tutorial is basically designed to guide you in scarping a web page. Basic requirement of all this is to get meaningful data out of huge unorganized set of data. The target audience of this tutorial can be anyone of − Anyone who wants to know – how to scrap webpage in python using BeautifulSoup. Any data science developer/enthusiasts or anyone, how wants to use this scraped (meaningful) data to different python data science libraries to make better decision. Prerequisites Though there is NO mandatory requirement to have for this tutorial. However, if you have any or all (supercool) prior knowledge on any below mentioned technologies that will be an added advantage − Knowledge of any web related technologies (HTML/CSS/Document object Model etc.). Python Language (as it is the python package). Developers who have any prior knowledge of scraping in any language. Basic understanding of HTML tree structure. Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – Scraping List from HTML

Beautiful Soup – Scraping List from HTML ”; Previous Next Web pages usually contain important data in the formation in the form of ordered or unordered lists. With Beautiful Soup, we can easily extract the HTML list elements, bring the data in Python objects to store in databases for further analysis. In this chapter, we shall use find() and select() methods to scrape the list data from a HTML document. Easiest way to search a parse tree is to search the tag by its name. soup.<tag> fetches the contents of the given tag. HTML provides <ol> and <ul> tags to compose ordered and unordered lists. Like any other tag, we can fetch the contents of these tags. We shall use the following HTML document − <html> <body> <h2>Departmentwise Employees</h2> <ul id=”dept”> <li>Accounts</li> <ul id=”acc”> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ol id=”HR”> <li>Rani</li> <li>Ankita</li> </ol> </ul> </body> </html> Scraping lists by Tag In the above HTML document, we have a top-level <ul> list, inside which there”s another <ul> tag and another <ol> tag. We first parse the document in soup object and retrieve contents of first <ul> in soup.ul Tag object. Example from bs4 import BeautifulSoup fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) lst=soup.ul print (lst) Output <ul id=”dept”> <li>Accounts</li> <ul id=”acc”> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ol id=”HR”> <li>Rani</li> <li>Ankita</li> </ol> </ul> Change value of lst to point to <ol> element to get the inner list. lst=soup.ol Output <ol id=”HR”> <li>Rani</li> <li>Ankita</li> </ol> Using select() method The select() method is essentially used to obtain data using CSS selector. However, you can also pass a tag to it. Here, we can pass the ol tag to select() method. The select_one() method is also available. It fetches the first occurrence of the given tag. Example from bs4 import BeautifulSoup fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) lst=soup.select(“ol”) print (lst) Output [<ol id=”HR”> <li>Rani</li> <li>Ankita</li> </ol>] Using find_all() method The find() and fin_all() methods are more comprehensive. You can pass various types of filters such as tag, attributes or string etc. to these methods. In this case, we want to fetch the contents of a list tag. In the following code, find_all() method returns a list of all elements in the <ul> tag. Example from bs4 import BeautifulSoup fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) lst=soup.find_all(“ul”) print (lst) We can refine the search filter by including the attrs argument. In our HTML document, the <ul> and <ol> tags, we have specified their respective id attributes. So, let us fetch the contents of <ul> element having id=”acc”. Example from bs4 import BeautifulSoup fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) lst=soup.find_all(“ul”, {“id”:”acc”}) print (lst) Output [<ul id=”acc”> <li>Anand</li> <li>Mahesh</li> </ul>] Here”s another example. We collect all elements with <li> tag with the inner text starting with ”A”. The find_all() method takes a keyword argument string. It takes the value of the text if the startingwith() function returns True. Example from bs4 import BeautifulSoup def startingwith(ch): return ch.startswith(”A”) fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) lst=soup.find_all(”li”,string=startingwith) print (lst) Output [<li>Accounts</li>, <li>Anand</li>, <li>Ankita</li>] Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – Scrape HTML Content

Beautiful Soup – Scrape HTML Content ”; Previous Next The process of extracting data from websites is called Web scraping. A web page may have urls, Email addresses, images or any other content, which we can be stored in a file or database. Searching a website manually is cumbersome process. There are different web scaping tools that automate the process. Web scraping is is sometimes prohibited by the use of ”robots.txt” file. Some popular sites provide APIs to access their data in a structured way. Unethical web scraping may result in getting your IP blocked. Python is widely used for web scraping. Python standard library has urllib package, which can be used to extract data from HTML pages. Since urllib module is bundled with the standard library, it need not be installed. The urllib package is an HTTP client for python programming language. The urllib.request module is usefule when we want to open and read URLs. Other module in urllib package are − urllib.error defines the exceptions and errors raised by the urllib.request command. urllib.parse is used for parsing URLs. urllib.robotparser is used for parsing robots.txt files. Use the urlopen() function in urllib module to read the content of a web page from a website. import urllib.request response = urllib.request.urlopen(”http://python.org/”) html = response.read() You can also use the requests library for this purpose. You need to install it before using. pip3 install requests In the below code, the homepage of http://www.tutorialspoint.com is scraped − from bs4 import BeautifulSoup import requests url = “https://www.tutorialspoint.com/index.htm” req = requests.get(url) The content obtained by either of the above two methods are then parsed with Beautiful Soup. Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – Navigating by Tags

Beautiful Soup – Navigating by Tags ”; Previous Next One of the important pieces of element in any piece of HTML document are tags, which may contain other tags/strings (tag”s children). Beautiful Soup provides different ways to navigate and iterate over”s tag”s children. Easiest way to search a parse tree is to search the tag by its name. soup.head The soup.head function returns the contents put inside the <head> .. </head> element of a HTML page. Consider the following HTML page to be scraped: <html> <head> <title>TutorialsPoint</title> <script> document.write(“Welcome to TutorialsPoint”); </script> </head> <body> <h1>Tutorialspoint Online Library</h1> <p><b>It”s all Free</b></p> </body> </html> Following code extracts the contents of <head> element Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) print(soup.head) Output <head> <title>TutorialsPoint</title> <script> document.write(“Welcome to TutorialsPoint”); </script> </head> soup.body Similarly, to return the contents of body part of HTML page, use soup.body Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) print (soup.body) Output <body> <h1>Tutorialspoint Online Library</h1> <p><b>It”s all Free</b></p> </body> You can also extract specific tag (like first <h1> tag) in the <body> tag. Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) print(soup.body.h1) Output <h1>Tutorialspoint Online Library</h1> soup.p Our HTML file contains a <p> tag. We can extract the contents of this tag Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) print(soup.p) Output <p><b>It”s all Free</b></p> Tag.contents A Tag object may have one or more PageElements. The Tag object”s contents property returns a list of all elements included in it. Let us find the elements in <head> tag of our index.html file. Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) tag = soup.head print (tag.contents) Output [”n”, <title>TutorialsPoint</title>, ”n”, <script> document.write(“Welcome to TutorialsPoint”); </script>, ”n”] Tag.children The structure of tags in a HTML script is hierarchical. The elements are nested one inside the other. For example, the top level <HTML> tag includes <HEAD> and <BODY> tags, each may have other tags in it. The Tag object has a children property that returns a list iterator object containing the enclosed PageElements. To demonstrate the children property, we shall use the following HTML script (index.html). In the <body> section, there are two <ul> list elements, one nested in another. In other words, the body tag has top level list elements, and each list element has another list under it. <html> <head> <title>TutorialsPoint</title> </head> <body> <h2>Departmentwise Employees</h2> <ul> <li>Accounts</li> <ul> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ul> <li>Rani</li> <li>Ankita</li> </ul> </ul> </body> </html> The following Python code gives a list of all the children elements of top level <ul> tag. Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) tag = soup.ul print (list(tag.children)) Output [”n”, <li>Accounts</li>, ”n”, <ul> <li>Anand</li> <li>Mahesh</li> </ul>, ”n”, <li>HR</li>, ”n”, <ul> <li>Rani</li> <li>Ankita</li> </ul>, ”n”] Since the .children property returns a list_iterator, we can use a for loop to traverse the hierarchy. Example for child in tag.children: print (child) Output <li>Accounts</li> <ul> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ul> <li>Rani</li> <li>Ankita</li> </ul> Tag.find_all() This method returns a result set of contents of all the tags matching with the argument tag provided. Let us consider the following HTML page(index.html) for this − <html> <body> <h1>Tutorialspoint Online Library</h1> <p><b>It”s all Free</b></p> <a class=”prog” href=”https://www.tutorialspoint.com/java/java_overview.htm” id=”link1″>Java</a> <a class=”prog” href=”https://www.tutorialspoint.com/cprogramming/index.htm” id=”link2″>C</a> <a class=”prog” href=”https://www.tutorialspoint.com/python/index.htm” id=”link3″>Python</a> <a class=”prog” href=”https://www.tutorialspoint.com/javascript/javascript_overview.htm” id=”link4″>JavaScript</a> <a class=”prog” href=”https://www.tutorialspoint.com/ruby/index.htm” id=”link5″>C</a> </body> </html> The following code lists all the elements with <a> tag Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) result = soup.find_all(“a”) print (result) Output [ <a class=”prog” href=”https://www.tutorialspoint.com/java/java_overview.htm” id=”link1″>Java</a>, <a class=”prog” href=”https://www.tutorialspoint.com/cprogramming/index.htm” id=”link2″>C</a>, <a class=”prog” href=”https://www.tutorialspoint.com/python/index.htm” id=”link3″>Python</a>, <a class=”prog” href=”https://www.tutorialspoint.com/javascript/javascript_overview.htm” id=”link4″>JavaScript</a>, <a class=”prog” href=”https://www.tutorialspoint.com/ruby/index.htm” id=”link5″>C</a> ] Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – Find Elements by Attribute

Beautiful Soup – Find Elements by Attribute ”; Previous Next Both find() and find_all() methods are meant to find one or all the tags in the document as per the arguments passed to these methods. You can pass attrs parameter to these functions. The value of attrs must be a dictionary with one or more tag attributes and their values. For the purpose of checking the behaviour of these methods, we shall use the following HTML document (index.html) <html> <head> <title>TutorialsPoint</title> </head> <body> <form> <input type = ”text” id = ”nm” name = ”name”> <input type = ”text” id = ”age” name = ”age”> <input type = ”text” id = ”marks” name = ”marks”> </form> </body> </html> Using find_all() The following program returns a list of all the tags having input type=”text” attribute. Example from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) obj = soup.find_all(attrs={“type”:”text”}) print (obj) Output [<input id=”nm” name=”name” type=”text”/>, <input id=”age” name=”age” type=”text”/>, <input id=”marks” name=”marks” type=”text”/>] Using find() The find() method returns the first tag in the parsed document that has the given attributes. obj = soup.find(attrs={“name”:”marks”}) Using select() The select() method can be called by passing the attributes to be compared against. The attributes must be put in a list object. It returns a list of all tags that have the given attribute. In the following code, the select() method returns all the tags with type attribute. Example from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) obj = soup.select(“[type]”) print (obj) Output [<input id=”nm” name=”name” type=”text”/>, <input id=”age” name=”age” type=”text”/>, <input id=”marks” name=”marks” type=”text”/>] Using select_one() The select_one() is method is similar, except that it returns the first tag satisfying the given filter. obj = soup.select_one(“[name=”marks”]”) Output <input id=”marks” name=”marks” type=”text”/> Print Page Previous Next Advertisements ”;