Beautiful Soup – Extract Email IDs


Beautiful Soup – Extract Email IDs



”;


To Extract Email addresses from a web page is an important application a web scraping library such as BeautifulSoup. In any web page, the Email IDs usually appear in the href attribute of anchor <a> tag. The Email ID is written using mailto URL scheme. Many a times, the Email Address may be present in page content as a normal text (without any hyperlink). In this chapter, we shall use BeautifulSoup library to fetch Email IDs from HTML page, with simple techniques.

A typical usage of Email ID in href attribute is as below −


<a href = "mailto:[email protected]">test link</a>

In the first example, we shall consider the following HTML document for extracting the Email IDs from the hyperlinks −


<html>
   <head>
      <title>BeautifulSoup - Scraping Email IDs</title>
   </head>
   <body>
      <h2>Contact Us</h2>
      <ul>
      <li><a href = "mailto:[email protected]">Sales Enquiries</a></li>
      <li><a href = "mailto:[email protected]">Careers</a></li>
      <li><a href = "mailto:[email protected]">Partner with us</a></li>
      </ul>
   </body>
</html>

Here”s the Python code that finds the Email Ids. We collect all the <a> tags in the document, and check if the tag has href attribute. If true, the part of its value after 6th character is the email Id.


from bs4 import BeautifulSoup
import re
fp = open("contact.html")
soup = BeautifulSoup(fp, "html.parser")
tags = soup.find_all("a")
for tag in tags:
   if tag.has_attr("href") and tag[''href''][:7]==''mailto:'':
      print (tag[''href''][7:])

For the given HTML document, the Email IDs will be extracted as follows −


[email protected]
[email protected]
[email protected]

In the second example, we assume that the Email IDs appear anywhere in the text. To extract them, we use the regex searching mechanism. Regex is a complex character pattern. Python”s re module helps in processing the regex (Regular Expression) patterns. The following regex pattern is used for searching the email address −


pat = r''[w.+-]+@[w-]+.[w.-]+''

For this exercise, we shall use the following HTML document, having Email IDs in <li>tags.


<html>
   <head>
      <title>BeautifulSoup - Scraping Email IDs</title>
   </head>
   <body>
      <h2>Contact Us</h2>
      <ul>
      <li>Sales Enquiries: [email protected]</a></li>
      <li>Careers: [email protected]</a></li>
      <li>Partner with us: [email protected]</a></li>
      </ul>
   </body>
</html>

Using the email regex, we”ll find the appearance of the pattern in each <li> tag string. Here is the Python code −

Example


from bs4 import BeautifulSoup
import re

def isemail(s):
   pat = r''[w.+-]+@[w-]+.[w.-]+''
   grp=re.findall(pat,s)
   return (grp)

fp = open("contact.html")
soup = BeautifulSoup(fp, "html.parser")
tags = soup.find_all(''li'')

for tag in tags:
   emails = isemail(tag.string)
   if emails:
      print (emails)

Output


[''[email protected]'']
[''[email protected]'']
[''[email protected]'']

Using the simple techniques described above, we can use BeautifulSoup to extract Email IDs from web pages.

Advertisements

”;

Leave a Reply

Your email address will not be published. Required fields are marked *