Processing Images and Videos

Processing Images and Videos ”; Previous Next Web scraping usually involves downloading, storing and processing the web media content. In this chapter, let us understand how to process the content downloaded from the web. Introduction The web media content that we obtain during scraping can be images, audio and video files, in the form of non-web pages as well as data files. But, can we trust the downloaded data especially on the extension of data we are going to download and store in our computer memory? This makes it essential to know about the type of data we are going to store locally. Getting Media Content from Web Page In this section, we are going to learn how we can download media content which correctly represents the media type based on the information from web server. We can do it with the help of Python requests module as we did in previous chapter. First, we need to import necessary Python modules as follows − import requests Now, provide the URL of the media content we want to download and store locally. url = “https://authoraditiagarwal.com/wpcontent/uploads/2018/05/MetaSlider_ThinkBig-1080×180.jpg” Use the following code to create HTTP response object. r = requests.get(url) With the help of following line of code, we can save the received content as .png file. with open(“ThinkBig.png”,”wb”) as f: f.write(r.content) After running the above Python script, we will get a file named ThinkBig.png, which would have the downloaded image. Extracting Filename from URL After downloading the content from web site, we also want to save it in a file with a file name found in the URL. But we can also check, if numbers of additional fragments exist in URL too. For this, we need to find the actual filename from the URL. With the help of following Python script, using urlparse, we can extract the filename from URL − import urllib3 import os url = “https://authoraditiagarwal.com/wpcontent/uploads/2018/05/MetaSlider_ThinkBig-1080×180.jpg” a = urlparse(url) a.path You can observe the output as shown below − ”/wp-content/uploads/2018/05/MetaSlider_ThinkBig-1080×180.jpg” os.path.basename(a.path) You can observe the output as shown below − ”MetaSlider_ThinkBig-1080×180.jpg” Once you run the above script, we will get the filename from URL. Information about Type of Content from URL While extracting the contents from web server, by GET request, we can also check its information provided by the web server. With the help of following Python script we can determine what web server means with the type of the content − First, we need to import necessary Python modules as follows − import requests Now, we need to provide the URL of the media content we want to download and store locally. url = “https://authoraditiagarwal.com/wpcontent/uploads/2018/05/MetaSlider_ThinkBig-1080×180.jpg” Following line of code will create HTTP response object. r = requests.get(url, allow_redirects=True) Now, we can get what type of information about content can be provided by web server. for headers in r.headers: print(headers) You can observe the output as shown below − Date Server Upgrade Connection Last-Modified Accept-Ranges Content-Length Keep-Alive Content-Type With the help of following line of code we can get the particular information about content type, say content-type − print (r.headers.get(”content-type”)) You can observe the output as shown below − image/jpeg With the help of following line of code, we can get the particular information about content type, say EType − print (r.headers.get(”ETag”)) You can observe the output as shown below − None Observe the following command − print (r.headers.get(”content-length”)) You can observe the output as shown below − 12636 With the help of following line of code we can get the particular information about content type, say Server − print (r.headers.get(”Server”)) You can observe the output as shown below − Apache Generating Thumbnail for Images Thumbnail is a very small description or representation. A user may want to save only thumbnail of a large image or save both the image as well as thumbnail. In this section we are going to create a thumbnail of the image named ThinkBig.png downloaded in the previous section “Getting media content from web page”. For this Python script, we need to install Python library named Pillow, a fork of the Python Image library having useful functions for manipulating images. It can be installed with the help of following command − pip install pillow The following Python script will create a thumbnail of the image and will save it to the current directory by prefixing thumbnail file with Th_ import glob from PIL import Image for infile in glob.glob(“ThinkBig.png”): img = Image.open(infile) img.thumbnail((128, 128), Image.ANTIALIAS) if infile[0:2] != “Th_”: img.save(“Th_” + infile, “png”) The above code is very easy to understand and you can check for the thumbnail file in the current directory. Screenshot from Website In web scraping, a very common task is to take screenshot of a website. For implementing this, we are going to use selenium and webdriver. The following Python script will take the screenshot from website and will save it to current directory. From selenium import webdriver path = r”C:\Users\gaurav\Desktop\Chromedriver” browser = webdriver.Chrome(executable_path = path) browser.get(”https://tutorialspoint.com/”) screenshot = browser.save_screenshot(”screenshot.png”) browser.quit You can observe the output as shown below − DevTools listening on ws://127.0.0.1:1456/devtools/browser/488ed704-9f1b-44f0- a571-892dc4c90eb7 <bound method WebDriver.quit of <selenium.webdriver.chrome.webdriver.WebDriver (session=”37e8e440e2f7807ef41ca7aa20ce7c97″)>> After running the script, you can check your current directory for screenshot.png file. Thumbnail Generation for Video Suppose we have downloaded videos from website and wanted to generate thumbnails for them so that a specific video, based on its thumbnail, can be clicked. For generating thumbnail for videos we need a simple tool called ffmpeg which can be downloaded from www.ffmpeg.org. After downloading, we need to install it as per the specifications of our OS. The following Python script will generate thumbnail of the video and will save it to our local directory − import subprocess video_MP4_file = “C:Usersgauravdesktopsolar.mp4 thumbnail_image_file = ”thumbnail_solar_video.jpg” subprocess.call([”ffmpeg”, ”-i”, video_MP4_file, ”-ss”, ”00:00:20.000”, ”- vframes”, ”1”, thumbnail_image_file, “-y”]) After running the above script, we will get the thumbnail named thumbnail_solar_video.jpg saved in our local directory. Ripping an MP4 video to an MP3 Suppose you have downloaded some video file from a website,

Data Processing

Python Web Scraping – Data Processing ”; Previous Next In earlier chapters, we learned about extracting the data from web pages or web scraping by various Python modules. In this chapter, let us look into various techniques to process the data that has been scraped. Introduction To process the data that has been scraped, we must store the data on our local machine in a particular format like spreadsheet (CSV), JSON or sometimes in databases like MySQL. CSV and JSON Data Processing First, we are going to write the information, after grabbing from web page, into a CSV file or a spreadsheet. Let us first understand through a simple example in which we will first grab the information using BeautifulSoup module, as did earlier, and then by using Python CSV module we will write that textual information into CSV file. First, we need to import the necessary Python libraries as follows − import requests from bs4 import BeautifulSoup import csv In this following line of code, we use requests to make a GET HTTP requests for the url: https://authoraditiagarwal.com/ by making a GET request. r = requests.get(”https://authoraditiagarwal.com/”) Now, we need to create a Soup object as follows − soup = BeautifulSoup(r.text, ”lxml”) Now, with the help of next lines of code, we will write the grabbed data into a CSV file named dataprocessing.csv. f = csv.writer(open(” dataprocessing.csv ”,”w”)) f.writerow([”Title”]) f.writerow([soup.title.text]) After running this script, the textual information or the title of the webpage will be saved in the above mentioned CSV file on your local machine. Similarly, we can save the collected information in a JSON file. The following is an easy to understand Python script for doing the same in which we are grabbing the same information as we did in last Python script, but this time the grabbed information is saved in JSONfile.txt by using JSON Python module. import requests from bs4 import BeautifulSoup import csv import json r = requests.get(”https://authoraditiagarwal.com/”) soup = BeautifulSoup(r.text, ”lxml”) y = json.dumps(soup.title.text) with open(”JSONFile.txt”, ”wt”) as outfile: json.dump(y, outfile) After running this script, the grabbed information i.e. title of the webpage will be saved in the above mentioned text file on your local machine. Data Processing using AWS S3 Sometimes we may want to save scraped data in our local storage for archive purpose. But what if the we need to store and analyze this data at a massive scale? The answer is cloud storage service named Amazon S3 or AWS S3 (Simple Storage Service). Basically AWS S3 is an object storage which is built to store and retrieve any amount of data from anywhere. We can follow the following steps for storing data in AWS S3 − Step 1 − First we need an AWS account which will provide us the secret keys for using in our Python script while storing the data. It will create a S3 bucket in which we can store our data. Step 2 − Next, we need to install boto3 Python library for accessing S3 bucket. It can be installed with the help of the following command − pip install boto3 Step 3 − Next, we can use the following Python script for scraping data from web page and saving it to AWS S3 bucket. First, we need to import Python libraries for scraping, here we are working with requests, and boto3 saving data to S3 bucket. import requests import boto3 Now we can scrape the data from our URL. data = requests.get(“Enter the URL”).text Now for storing data to S3 bucket, we need to create S3 client as follows − s3 = boto3.client(”s3”) bucket_name = “our-content” Next line of code will create S3 bucket as follows − s3.create_bucket(Bucket = bucket_name, ACL = ”public-read”) s3.put_object(Bucket = bucket_name, Key = ””, Body = data, ACL = “public-read”) Now you can check the bucket with name our-content from your AWS account. Data processing using MySQL Let us learn how to process data using MySQL. If you want to learn about MySQL, then you can follow the link https://www.tutorialspoint.com/mysql/. With the help of following steps, we can scrape and process data into MySQL table − Step 1 − First, by using MySQL we need to create a database and table in which we want to save our scraped data. For example, we are creating the table with following query − CREATE TABLE Scrap_pages (id BIGINT(7) NOT NULL AUTO_INCREMENT, title VARCHAR(200), content VARCHAR(10000),PRIMARY KEY(id)); Step 2 − Next, we need to deal with Unicode. Note that MySQL does not handle Unicode by default. We need to turn on this feature with the help of following commands which will change the default character set for the database, for the table and for both of the columns − ALTER DATABASE scrap CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci; ALTER TABLE Scrap_pages CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; ALTER TABLE Scrap_pages CHANGE title title VARCHAR(200) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; ALTER TABLE pages CHANGE content content VARCHAR(10000) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; Step 3 − Now, integrate MySQL with Python. For this, we will need PyMySQL which can be installed with the help of the following command pip install PyMySQL Step 4 − Now, our database named Scrap, created earlier, is ready to save the data, after scraped from web, into table named Scrap_pages. Here in our example we are going to scrape data from Wikipedia and it will be saved into our database. First, we need to import the required Python modules. from urllib.request import urlopen from bs4 import BeautifulSoup import datetime import random import pymysql import re Now, make a connection, that is integrate this with Python. conn = pymysql.connect(host=”127.0.0.1”,user=”root”, passwd = None, db = ”mysql”, charset = ”utf8”) cur = conn.cursor() cur.execute(“USE scrap”) random.seed(datetime.datetime.now()) def store(title, content): cur.execute(”INSERT INTO scrap_pages (title, content) VALUES ””(“%s”,”%s”)”, (title, content)) cur.connection.commit() Now, connect with Wikipedia and get data from it. def getLinks(articleUrl): html = urlopen(”http://en.wikipedia.org”+articleUrl) bs = BeautifulSoup(html, ”html.parser”) title = bs.find(”h1”).get_text() content = bs.find(”div”, {”id”:”mw-content-text”}).find(”p”).get_text() store(title, content) return

Python Web Scraping – Resources

Python Web Scraping – Useful Resources ”; Previous Next The following resources contain additional information on Python Web Scraping. Please use them to get more in-depth knowledge on this topic. Useful Video Courses Scrapy Course: Python Web Scraping & Crawling for Beginners 28 Lectures 3.5 hours Attreya Bhatt More Detail Web Scraping using Excel VBA 20 Lectures 2 hours Kamal Kishor Girdher More Detail Web Scraping using API, Beautiful Soup using Python 39 Lectures 3.5 hours Chandramouli Jayendran More Detail Web Scraping for Data Science – Python & Selenium – Basics 28 Lectures 3 hours AlexanderSchlee More Detail Web Scraping APIs for Data Science 2021 | PostgreSQL+Excel 21 Lectures 4 hours AlexanderSchlee More Detail Web Scraping with Beautiful Soup for Data Science Best Seller 22 Lectures 4 hours AlexanderSchlee More Detail Print Page Previous Next Advertisements ”;

Testing with Scrapers

Python Web Scraping – Testing with Scrapers ”; Previous Next This chapter explains how to perform testing using web scrapers in Python. Introduction In large web projects, automated testing of website’s backend is performed regularly but the frontend testing is skipped often. The main reason behind this is that the programming of websites is just like a net of various markup and programming languages. We can write unit test for one language but it becomes challenging if the interaction is being done in another language. That is why we must have suite of tests to make sure that our code is performing as per our expectation. Testing using Python When we are talking about testing, it means unit testing. Before diving deep into testing with Python, we must know about unit testing. Following are some of the characteristics of unit testing − At-least one aspect of the functionality of a component would be tested in each unit test. Each unit test is independent and can also run independently. Unit test does not interfere with success or failure of any other test. Unit tests can run in any order and must contain at least one assertion. Unittest − Python Module Python module named Unittest for unit testing is comes with all the standard Python installation. We just need to import it and rest is the task of unittest.TestCase class which will do the followings − SetUp and tearDown functions are provided by unittest.TestCase class. These functions can run before and after each unit test. It also provides assert statements to allow tests to pass or fail. It runs all the functions that begin with test_ as unit test. Example In this example we are going to combine web scraping with unittest. We will test Wikipedia page for searching string ‘Python’. It will basically do two tests, first weather the title page is same as the search string i.e.‘Python’ or not and second test makes sure that the page has a content div. First, we will import the required Python modules. We are using BeautifulSoup for web scraping and of course unittest for testing. from urllib.request import urlopen from bs4 import BeautifulSoup import unittest Now we need to define a class which will extend unittest.TestCase. Global object bs would be shared between all tests. A unittest specified function setUpClass will accomplish it. Here we will define two functions, one for testing the title page and other for testing the page content. class Test(unittest.TestCase): bs = None def setUpClass(): url = ”<a target=”_blank” rel=”nofollow” href=”https://en.wikipedia.org/wiki/Python”>https://en.wikipedia.org/wiki/Python”</a> Test.bs = BeautifulSoup(urlopen(url), ”html.parser”) def test_titleText(self): pageTitle = Test.bs.find(”h1”).get_text() self.assertEqual(”Python”, pageTitle); def test_contentExists(self): content = Test.bs.find(”div”,{”id”:”mw-content-text”}) self.assertIsNotNone(content) if __name__ == ”__main__”: unittest.main() After running the above script we will get the following output − ———————————————————————- Ran 2 tests in 2.773s OK An exception has occurred, use %tb to see the full traceback. SystemExit: False D:ProgramDatalibsite-packagesIPythoncoreinteractiveshell.py:2870: UserWarning: To exit: use ”exit”, ”quit”, or Ctrl-D. warn(“To exit: use ”exit”, ”quit”, or Ctrl-D.”, stacklevel=1) Testing with Selenium Let us discuss how to use Python Selenium for testing. It is also called Selenium testing. Both Python unittest and Selenium do not have much in common. We know that Selenium sends the standard Python commands to different browsers, despite variation in their browser”s design. Recall that we already installed and worked with Selenium in previous chapters. Here we will create test scripts in Selenium and use it for automation. Example With the help of next Python script, we are creating test script for the automation of Facebook Login page. You can modify the example for automating other forms and logins of your choice, however the concept would be same. First for connecting to web browser, we will import webdriver from selenium module − from selenium import webdriver Now, we need to import Keys from selenium module. from selenium.webdriver.common.keys import Keys Next we need to provide username and password for login into our facebook account user = “[email protected]” pwd = “” Next, provide the path to web driver for Chrome. path = r”C:\Users\gaurav\Desktop\Chromedriver” driver = webdriver.Chrome(executable_path=path) driver.get(“http://www.facebook.com”) Now we will verify the conditions by using assert keyword. assert “Facebook” in driver.title With the help of following line of code we are sending values to the email section. Here we are searching it by its id but we can do it by searching it by name as driver.find_element_by_name(“email”). element = driver.find_element_by_id(“email”) element.send_keys(user) With the help of following line of code we are sending values to the password section. Here we are searching it by its id but we can do it by searching it by name as driver.find_element_by_name(“pass”). element = driver.find_element_by_id(“pass”) element.send_keys(pwd) Next line of code is used to press enter/login after inserting the values in email and password field. element.send_keys(Keys.RETURN) Now we will close the browser. driver.close() After running the above script, Chrome web browser will be opened and you can see email and password is being inserted and clicked on login button. Comparison: unittest or Selenium The comparison of unittest and selenium is difficult because if you want to work with large test suites, the syntactical rigidity of unites is required. On the other hand, if you are going to test website flexibility then Selenium test would be our first choice. But what if we can combine both of them. We can import selenium into Python unittest and get the best of both. Selenium can be used to get information about a website and unittest can evaluate whether that information meets the criteria for passing the test or not. For example, we are rewriting the above Python script for automation of Facebook login by combining both of them as follows − import unittest from selenium import webdriver class InputFormsCheck(unittest.TestCase): def setUp(self): self.driver = webdriver.Chrome(r”C:UsersgauravDesktopchromedriver”) def test_singleInputField(self): user = “[email protected]” pwd = “” pageUrl = “http://www.facebook.com” driver=self.driver driver.maximize_window() driver.get(pageUrl) assert “Facebook” in driver.title elem = driver.find_element_by_id(“email”) elem.send_keys(user) elem = driver.find_element_by_id(“pass”) elem.send_keys(pwd) elem.send_keys(Keys.RETURN) def tearDown(self): self.driver.close() if __name__ == “__main__”: unittest.main() Print Page

Python Web Scraping – Quick Guide

Python Web Scraping – Quick Guide ”; Previous Next Python Web Scraping – Introduction Web scraping is an automatic process of extracting information from web. This chapter will give you an in-depth idea of web scraping, its comparison with web crawling, and why you should opt for web scraping. You will also learn about the components and working of a web scraper. What is Web Scraping? The dictionary meaning of word ‘Scrapping’ implies getting something from the web. Here two questions arise: What we can get from the web and How to get that. The answer to the first question is ‘data’. Data is indispensable for any programmer and the basic requirement of every programming project is the large amount of useful data. The answer to the second question is a bit tricky, because there are lots of ways to get data. In general, we may get data from a database or data file and other sources. But what if we need large amount of data that is available online? One way to get such kind of data is to manually search (clicking away in a web browser) and save (copy-pasting into a spreadsheet or file) the required data. This method is quite tedious and time consuming. Another way to get such data is using web scraping. Web scraping, also called web data mining or web harvesting, is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically. In other words, we can say that instead of manually saving the data from websites, the web scraping software will automatically load and extract data from multiple websites as per our requirement. Origin of Web Scraping The origin of web scraping is screen scrapping, which was used to integrate non-web based applications or native windows applications. Originally screen scraping was used prior to the wide use of World Wide Web (WWW), but it could not scale up WWW expanded. This made it necessary to automate the approach of screen scraping and the technique called ‘Web Scraping’ came into existence. Web Crawling v/s Web Scraping The terms Web Crawling and Scraping are often used interchangeably as the basic concept of them is to extract data. However, they are different from each other. We can understand the basic difference from their definitions. Web crawling is basically used to index the information on the page using bots aka crawlers. It is also called indexing. On the hand, web scraping is an automated way of extracting the information using bots aka scrapers. It is also called data extraction. To understand the difference between these two terms, let us look into the comparison table given hereunder − Web Crawling Web Scraping Refers to downloading and storing the contents of a large number of websites. Refers to extracting individual data elements from the website by using a site-specific structure. Mostly done on large scale. Can be implemented at any scale. Yields generic information. Yields specific information. Used by major search engines like Google, Bing, Yahoo. Googlebot is an example of a web crawler. The information extracted using web scraping can be used to replicate in some other website or can be used to perform data analysis. For example the data elements can be names, address, price etc. Uses of Web Scraping The uses and reasons for using web scraping are as endless as the uses of the World Wide Web. Web scrapers can do anything like ordering online food, scanning online shopping website for you and buying ticket of a match the moment they are available etc. just like a human can do. Some of the important uses of web scraping are discussed here − E-commerce Websites − Web scrapers can collect the data specially related to the price of a specific product from various e-commerce websites for their comparison. Content Aggregators − Web scraping is used widely by content aggregators like news aggregators and job aggregators for providing updated data to their users. Marketing and Sales Campaigns − Web scrapers can be used to get the data like emails, phone number etc. for sales and marketing campaigns. Search Engine Optimization (SEO) − Web scraping is widely used by SEO tools like SEMRush, Majestic etc. to tell business how they rank for search keywords that matter to them. Data for Machine Learning Projects − Retrieval of data for machine learning projects depends upon web scraping. Data for Research − Researchers can collect useful data for the purpose of their research work by saving their time by this automated process. Components of a Web Scraper A web scraper consists of the following components − Web Crawler Module A very necessary component of web scraper, web crawler module, is used to navigate the target website by making HTTP or HTTPS request to the URLs. The crawler downloads the unstructured data (HTML contents) and passes it to extractor, the next module. Extractor The extractor processes the fetched HTML content and extracts the data into semistructured format. This is also called as a parser module and uses different parsing techniques like Regular expression, HTML Parsing, DOM parsing or Artificial Intelligence for its functioning. Data Transformation and Cleaning Module The data extracted above is not suitable for ready use. It must pass through some cleaning module so that we can use it. The methods like String manipulation or regular expression can be used for this purpose. Note that extraction and transformation can be performed in a single step also. Storage Module After extracting the data, we need to store it as per our requirement. The storage module will output the data in a standard format that can be stored in a database or JSON or CSV format. Working of a Web Scraper Web scraper may be defined as a software or script used to download the contents of multiple web pages and extracting data from it. We can understand the working of a web scraper in simple steps as shown

Dealing with Text

Python Web Scraping – Dealing with Text ”; Previous Next In the previous chapter, we have seen how to deal with videos and images that we obtain as a part of web scraping content. In this chapter we are going to deal with text analysis by using Python library and will learn about this in detail. Introduction You can perform text analysis in by using Python library called Natural Language Tool Kit (NLTK). Before proceeding into the concepts of NLTK, let us understand the relation between text analysis and web scraping. Analyzing the words in the text can lead us to know about which words are important, which words are unusual, how words are grouped. This analysis eases the task of web scraping. Getting started with NLTK The Natural language toolkit (NLTK) is collection of Python libraries which is designed especially for identifying and tagging parts of speech found in the text of natural language like English. Installing NLTK You can use the following command to install NLTK in Python − pip install nltk If you are using Anaconda, then a conda package for NLTK can be built by using the following command − conda install -c anaconda nltk Downloading NLTK’s Data After installing NLTK, we have to download preset text repositories. But before downloading text preset repositories, we need to import NLTK with the help of import command as follows − mport nltk Now, with the help of following command NLTK data can be downloaded − nltk.download() Installation of all available packages of NLTK will take some time, but it is always recommended to install all the packages. Installing Other Necessary packages We also need some other Python packages like gensim and pattern for doing text analysis as well as building building natural language processing applications by using NLTK. gensim − A robust semantic modeling library which is useful for many applications. It can be installed by the following command − pip install gensim pattern − Used to make gensim package work properly. It can be installed by the following command − pip install pattern Tokenization The Process of breaking the given text, into the smaller units called tokens, is called tokenization. These tokens can be the words, numbers or punctuation marks. It is also called word segmentation. Example NLTK module provides different packages for tokenization. We can use these packages as per our requirement. Some of the packages are described here − sent_tokenize package − This package will divide the input text into sentences. You can use the following command to import this package − from nltk.tokenize import sent_tokenize word_tokenize package − This package will divide the input text into words. You can use the following command to import this package − from nltk.tokenize import word_tokenize WordPunctTokenizer package − This package will divide the input text as well as the punctuation marks into words. You can use the following command to import this package − from nltk.tokenize import WordPuncttokenizer Stemming In any language, there are different forms of a words. A language includes lots of variations due to the grammatical reasons. For example, consider the words democracy, democratic, and democratization. For machine learning as well as for web scraping projects, it is important for machines to understand that these different words have the same base form. Hence we can say that it can be useful to extract the base forms of the words while analyzing the text. This can be achieved by stemming which may be defined as the heuristic process of extracting the base forms of the words by chopping off the ends of words. NLTK module provides different packages for stemming. We can use these packages as per our requirement. Some of these packages are described here − PorterStemmer package − Porter’s algorithm is used by this Python stemming package to extract the base form. You can use the following command to import this package − from nltk.stem.porter import PorterStemmer For example, after giving the word ‘writing’ as the input to this stemmer, the output would be the word ‘write’ after stemming. LancasterStemmer package − Lancaster’s algorithm is used by this Python stemming package to extract the base form. You can use the following command to import this package − from nltk.stem.lancaster import LancasterStemmer For example, after giving the word ‘writing’ as the input to this stemmer then the output would be the word ‘writ’ after stemming. SnowballStemmer package − Snowball’s algorithm is used by this Python stemming package to extract the base form. You can use the following command to import this package − from nltk.stem.snowball import SnowballStemmer For example, after giving the word ‘writing’ as the input to this stemmer then the output would be the word ‘write’ after stemming. Lemmatization An other way to extract the base form of words is by lemmatization, normally aiming to remove inflectional endings by using vocabulary and morphological analysis. The base form of any word after lemmatization is called lemma. NLTK module provides following packages for lemmatization − WordNetLemmatizer package − It will extract the base form of the word depending upon whether it is used as noun as a verb. You can use the following command to import this package − from nltk.stem import WordNetLemmatizer Chunking Chunking, which means dividing the data into small chunks, is one of the important processes in natural language processing to identify the parts of speech and short phrases like noun phrases. Chunking is to do the labeling of tokens. We can get the structure of the sentence with the help of chunking process. Example In this example, we are going to implement Noun-Phrase chunking by using NLTK Python module. NP chunking is a category of chunking which will find the noun phrases chunks in the sentence. Steps for implementing noun phrase chunking We need to follow the steps given below for implementing noun-phrase chunking − Step 1 − Chunk grammar definition In the first step we will define the grammar for chunking. It would consist of

Python Modules for Web Scraping

Python Modules for Web Scraping ”; Previous Next In this chapter, let us learn various Python modules that we can use for web scraping. Python Development Environments using virtualenv Virtualenv is a tool to create isolated Python environments. With the help of virtualenv, we can create a folder that contains all necessary executables to use the packages that our Python project requires. It also allows us to add and modify Python modules without access to the global installation. You can use the following command to install virtualenv − (base) D:ProgramData>pip install virtualenv Collecting virtualenv Downloading https://files.pythonhosted.org/packages/b6/30/96a02b2287098b23b875bc8c2f58071c3 5d2efe84f747b64d523721dc2b5/virtualenv-16.0.0-py2.py3-none-any.whl (1.9MB) 100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 1.9MB 86kB/s Installing collected packages: virtualenv Successfully installed virtualenv-16.0.0 Now, we need to create a directory which will represent the project with the help of following command − (base) D:ProgramData>mkdir webscrap Now, enter into that directory with the help of this following command − (base) D:ProgramData>cd webscrap Now, we need to initialize virtual environment folder of our choice as follows − (base) D:ProgramDatawebscrap>virtualenv websc Using base prefix ”d:\programdata” New python executable in D:ProgramDatawebscrapwebscScriptspython.exe Installing setuptools, pip, wheel…done. Now, activate the virtual environment with the command given below. Once successfully activated, you will see the name of it on the left hand side in brackets. (base) D:ProgramDatawebscrap>webscscriptsactivate We can install any module in this environment as follows − (websc) (base) D:ProgramDatawebscrap>pip install requests Collecting requests Downloading https://files.pythonhosted.org/packages/65/47/7e02164a2a3db50ed6d8a6ab1d6d60b69 c4c3fdf57a284257925dfc12bda/requests-2.19.1-py2.py3-none-any.whl (9 1kB) 100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 92kB 148kB/s Collecting chardet<3.1.0,>=3.0.2 (from requests) Downloading https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca 55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl (133 kB) 100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 143kB 369kB/s Collecting certifi>=2017.4.17 (from requests) Downloading https://files.pythonhosted.org/packages/df/f7/04fee6ac349e915b82171f8e23cee6364 4d83663b34c539f7a09aed18f9e/certifi-2018.8.24-py2.py3-none-any.whl (147kB) 100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 153kB 527kB/s Collecting urllib3<1.24,>=1.21.1 (from requests) Downloading https://files.pythonhosted.org/packages/bd/c9/6fdd990019071a4a32a5e7cb78a1d92c5 3851ef4f56f62a3486e6a7d8ffb/urllib3-1.23-py2.py3-none-any.whl (133k B) 100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 143kB 517kB/s Collecting idna<2.8,>=2.5 (from requests) Downloading https://files.pythonhosted.org/packages/4b/2a/0276479a4b3caeb8a8c1af2f8e4355746 a97fab05a372e4a2c6a6b876165/idna-2.7-py2.py3-none-any.whl (58kB) 100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 61kB 339kB/s Installing collected packages: chardet, certifi, urllib3, idna, requests Successfully installed certifi-2018.8.24 chardet-3.0.4 idna-2.7 requests-2.19.1 urllib3-1.23 For deactivating the virtual environment, we can use the following command − (websc) (base) D:ProgramDatawebscrap>deactivate (base) D:ProgramDatawebscrap> You can see that (websc) has been deactivated. Python Modules for Web Scraping Web scraping is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically. In other words, instead of manually saving the data from websites, the web scraping software will automatically load and extract data from multiple websites as per our requirement. In this section, we are going to discuss about useful Python libraries for web scraping. Requests It is a simple python web scraping library. It is an efficient HTTP library used for accessing web pages. With the help of Requests, we can get the raw HTML of web pages which can then be parsed for retrieving the data. Before using requests, let us understand its installation. Installing Requests We can install it in either on our virtual environment or on the global installation. With the help of pip command, we can easily install it as follows − (base) D:ProgramData> pip install requests Collecting requests Using cached https://files.pythonhosted.org/packages/65/47/7e02164a2a3db50ed6d8a6ab1d6d60b69 c4c3fdf57a284257925dfc12bda/requests-2.19.1-py2.py3-none-any.whl Requirement already satisfied: idna<2.8,>=2.5 in d:programdatalibsitepackages (from requests) (2.6) Requirement already satisfied: urllib3<1.24,>=1.21.1 in d:programdatalibsite-packages (from requests) (1.22) Requirement already satisfied: certifi>=2017.4.17 in d:programdatalibsitepackages (from requests) (2018.1.18) Requirement already satisfied: chardet<3.1.0,>=3.0.2 in d:programdatalibsite-packages (from requests) (3.0.4) Installing collected packages: requests Successfully installed requests-2.19.1 Example In this example, we are making a GET HTTP request for a web page. For this we need to first import requests library as follows − In [1]: import requests In this following line of code, we use requests to make a GET HTTP requests for the url: https://authoraditiagarwal.com/ by making a GET request. In [2]: r = requests.get(”https://authoraditiagarwal.com/”) Now we can retrieve the content by using .text property as follows − In [5]: r.text[:200] Observe that in the following output, we got the first 200 characters. Out[5]: ”<!DOCTYPE html>n<html lang=”en-US”ntitemscope ntitemtype=”http://schema.org/WebSite” ntprefix=”og: http://ogp.me/ns#” >n<head>nt<meta charset =”UTF-8″ />nt<meta http-equiv=”X-UA-Compatible” content=”IE” Urllib3 It is another Python library that can be used for retrieving data from URLs similar to the requests library. You can read more on this at its technical documentation at https://urllib3.readthedocs.io/en/latest/. Installing Urllib3 Using the pip command, we can install urllib3 either in our virtual environment or in global installation. (base) D:ProgramData>pip install urllib3 Collecting urllib3 Using cached https://files.pythonhosted.org/packages/bd/c9/6fdd990019071a4a32a5e7cb78a1d92c5 3851ef4f56f62a3486e6a7d8ffb/urllib3-1.23-py2.py3-none-any.whl Installing collected packages: urllib3 Successfully installed urllib3-1.23 Example: Scraping using Urllib3 and BeautifulSoup In the following example, we are scraping the web page by using Urllib3 and BeautifulSoup. We are using Urllib3 at the place of requests library for getting the raw data (HTML) from web page. Then we are using BeautifulSoup for parsing that HTML data. import urllib3 from bs4 import BeautifulSoup http = urllib3.PoolManager() r = http.request(”GET”, ”https://authoraditiagarwal.com”) soup = BeautifulSoup(r.data, ”lxml”) print (soup.title) print (soup.title.text) This is the output you will observe when you run this code − <title>Learn and Grow with Aditi Agarwal</title> Learn and Grow with Aditi Agarwal Selenium It is an open source automated testing suite for web applications across different browsers and platforms. It is not a single tool but a suite of software. We have selenium bindings for Python, Java, C#, Ruby and JavaScript. Here we are going to perform web scraping by using selenium and its Python bindings. You can learn more about Selenium with Java on the link Selenium. Selenium Python bindings provide a convenient API to access Selenium WebDrivers like Firefox, IE, Chrome, Remote etc. The current supported Python versions are 2.7, 3.5 and above. Installing Selenium Using the pip command, we can install urllib3 either in our virtual environment or in global installation. pip install selenium As selenium requires a driver to interface with the chosen browser, we need to download it. The following table shows different browsers and their links for downloading the same. Chrome https://sites.google.com/a/chromium.org/ Edge https://developer.microsoft.com/ Firefox https://github.com/ Safari https://webkit.org/ Example This example shows web scraping using selenium. It can also be used for testing which is called selenium testing. After downloading the particular driver for the specified version of browser, we need to do programming in Python. First, need to import webdriver from selenium as follows − from selenium

Scraping Form based Websites

Python Web Scraping – Form based Websites ”; Previous Next In the previous chapter, we have seen scraping dynamic websites. In this chapter, let us understand scraping of websites that work on user based inputs, that is form based websites. Introduction These days WWW (World Wide Web) is moving towards social media as well as usergenerated contents. So the question arises how we can access such kind of information that is beyond login screen? For this we need to deal with forms and logins. In previous chapters, we worked with HTTP GET method to request information but in this chapter we will work with HTTP POST method that pushes information to a web server for storage and analysis. Interacting with Login forms While working on Internet, you must have interacted with login forms many times. They may be very simple like including only a very few HTML fields, a submit button and an action page or they may be complicated and have some additional fields like email, leave a message along with captcha for security reasons. In this section, we are going to deal with a simple submit form with the help of Python requests library. First, we need to import requests library as follows − import requests Now, we need to provide the information for the fields of login form. parameters = {‘Name’:’Enter your name’, ‘Email-id’:’Your Emailid’,’Message’:’Type your message here’} In next line of code, we need to provide the URL on which action of the form would happen. r = requests.post(“enter the URL”, data = parameters) print(r.text) After running the script, it will return the content of the page where action has happened. Suppose if you want to submit any image with the form, then it is very easy with requests.post(). You can understand it with the help of following Python script − import requests file = {‘Uploadfile’: open(’C:Usresdesktop123.png’,‘rb’)} r = requests.post(“enter the URL”, files = file) print(r.text) Loading Cookies from the Web Server A cookie, sometimes called web cookie or internet cookie, is a small piece of data sent from a website and our computer stores it in a file located inside our web browser. In the context of dealings with login forms, cookies can be of two types. One, we dealt in the previous section, that allows us to submit information to a website and second which lets us to remain in a permanent “logged-in” state throughout our visit to the website. For the second kind of forms, websites use cookies to keep track of who is logged in and who is not. What do cookies do? These days most of the websites are using cookies for tracking. We can understand the working of cookies with the help of following steps − Step 1 − First, the site will authenticate our login credentials and stores it in our browser’s cookie. This cookie generally contains a server-generated toke, time-out and tracking information. Step 2 − Next, the website will use the cookie as a proof of authentication. This authentication is always shown whenever we visit the website. Cookies are very problematic for web scrapers because if web scrapers do not keep track of the cookies, the submitted form is sent back and at the next page it seems that they never logged in. It is very easy to track the cookies with the help of Python requests library, as shown below − import requests parameters = {‘Name’:’Enter your name’, ‘Email-id’:’Your Emailid’,’Message’:’Type your message here’} r = requests.post(“enter the URL”, data = parameters) In the above line of code, the URL would be the page which will act as the processor for the login form. print(‘The cookie is:’) print(r.cookies.get_dict()) print(r.text) After running the above script, we will retrieve the cookies from the result of last request. There is another issue with cookies that sometimes websites frequently modify cookies without warning. Such kind of situation can be dealt with requests.Session() as follows − import requests session = requests.Session() parameters = {‘Name’:’Enter your name’, ‘Email-id’:’Your Emailid’,’Message’:’Type your message here’} r = session.post(“enter the URL”, data = parameters) In the above line of code, the URL would be the page which will act as the processor for the login form. print(‘The cookie is:’) print(r.cookies.get_dict()) print(r.text) Observe that you can easily understand the difference between script with session and without session. Automating forms with Python In this section we are going to deal with a Python module named Mechanize that will reduce our work and automate the process of filling up forms. Mechanize module Mechanize module provides us a high-level interface to interact with forms. Before starting using it we need to install it with the following command − pip install mechanize Note that it would work only in Python 2.x. Example In this example, we are going to automate the process of filling a login form having two fields namely email and password − import mechanize brwsr = mechanize.Browser() brwsr.open(Enter the URL of login) brwsr.select_form(nr = 0) brwsr[”email”] = ‘Enter email’ brwsr[”password”] = ‘Enter password’ response = brwsr.submit() brwsr.submit() The above code is very easy to understand. First, we imported mechanize module. Then a Mechanize browser object has been created. Then, we navigated to the login URL and selected the form. After that, names and values are passed directly to the browser object. Print Page Previous Next Advertisements ”;

Python Web Scraping – Home

Python Web Scraping Tutorial PDF Version Quick Guide Resources Job Search Discussion Web scraping, also called web data mining or web harvesting, is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically. This tutorial will teach you various concepts of web scraping and makes you comfortable with scraping various types of websites and their data. Audience This tutorial will be useful for graduates, post graduates, and research students who either have an interest in this subject or have this subject as a part of their curriculum. The tutorial suits the learning needs of both a beginner or an advanced learner. Prerequisites The reader must have basic knowledge about HTML, CSS, and Java Script. He/she should also be aware about basic terminologies used in Web Technology along with Python programming concepts. If you do not have knowledge on these concepts, we suggest you to go through tutorials on these concepts first. Print Page Previous Next Advertisements ”;

Data Extraction

Python Web Scraping – Data Extraction ”; Previous Next Analyzing a web page means understanding its sructure . Now, the question arises why it is important for web scraping? In this chapter, let us understand this in detail. Web page Analysis Web page analysis is important because without analyzing we are not able to know in which form we are going to receive the data from (structured or unstructured) that web page after extraction. We can do web page analysis in the following ways − Viewing Page Source This is a way to understand how a web page is structured by examining its source code. To implement this, we need to right click the page and then must select the View page source option. Then, we will get the data of our interest from that web page in the form of HTML. But the main concern is about whitespaces and formatting which is difficult for us to format. Inspecting Page Source by Clicking Inspect Element Option This is another way of analyzing web page. But the difference is that it will resolve the issue of formatting and whitespaces in the source code of web page. You can implement this by right clicking and then selecting the Inspect or Inspect element option from menu. It will provide the information about particular area or element of that web page. Different Ways to Extract Data from Web Page The following methods are mostly used for extracting data from a web page − Regular Expression They are highly specialized programming language embedded in Python. We can use it through re module of Python. It is also called RE or regexes or regex patterns. With the help of regular expressions, we can specify some rules for the possible set of strings we want to match from the data. If you want to learn more about regular expression in general, go to the link https://www.tutorialspoint.com/automata_theory/regular_expressions.htm and if you want to know more about re module or regular expression in Python, you can follow the link https://www.tutorialspoint.com/python/python_reg_expressions.htm. Example In the following example, we are going to scrape data about India from http://example.webscraping.com after matching the contents of <td> with the help of regular expression. import re import urllib.request response = urllib.request.urlopen(”http://example.webscraping.com/places/default/view/India-102”) html = response.read() text = html.decode() re.findall(”<td class=”w2p_fw”>(.*?)</td>”,text) Output The corresponding output will be as shown here − [ ”<img src=”/places/static/images/flags/in.png” />”, ”3,287,590 square kilometres”, ”1,173,108,018”, ”IN”, ”India”, ”New Delhi”, ”<a href=”/places/default/continent/AS”>AS</a>”, ”.in”, ”INR”, ”Rupee”, ”91”, ”######”, ”^(\d{6})$”, ”enIN,hi,bn,te,mr,ta,ur,gu,kn,ml,or,pa,as,bh,sat,ks,ne,sd,kok,doi,mni,sit,sa,fr,lus,inc”, ”<div> <a href=”/places/default/iso/CN”>CN </a> <a href=”/places/default/iso/NP”>NP </a> <a href=”/places/default/iso/MM”>MM </a> <a href=”/places/default/iso/BT”>BT </a> <a href=”/places/default/iso/PK”>PK </a> <a href=”/places/default/iso/BD”>BD </a> </div>” ] Observe that in the above output you can see the details about country India by using regular expression. Beautiful Soup Suppose we want to collect all the hyperlinks from a web page, then we can use a parser called BeautifulSoup which can be known in more detail at https://www.crummy.com/software/BeautifulSoup/bs4/doc/. In simple words, BeautifulSoup is a Python library for pulling data out of HTML and XML files. It can be used with requests, because it needs an input (document or url) to create a soup object asit cannot fetch a web page by itself. You can use the following Python script to gather the title of web page and hyperlinks. Installing Beautiful Soup Using the pip command, we can install beautifulsoup either in our virtual environment or in global installation. (base) D:ProgramData>pip install bs4 Collecting bs4 Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89 a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz Requirement already satisfied: beautifulsoup4 in d:programdatalibsitepackages (from bs4) (4.6.0) Building wheels for collected packages: bs4 Running setup.py bdist_wheel for bs4 … done Stored in directory: C:UsersgauravAppDataLocalpipCachewheelsa0b0b24f80b9456b87abedbc0bf2d 52235414c3467d8889be38dd472 Successfully built bs4 Installing collected packages: bs4 Successfully installed bs4-0.0.1 Example Note that in this example, we are extending the above example implemented with requests python module. we are using r.text for creating a soup object which will further be used to fetch details like title of the webpage. First, we need to import necessary Python modules − import requests from bs4 import BeautifulSoup In this following line of code we use requests to make a GET HTTP requests for the url: https://authoraditiagarwal.com/ by making a GET request. r = requests.get(”https://authoraditiagarwal.com/”) Now we need to create a Soup object as follows − soup = BeautifulSoup(r.text, ”lxml”) print (soup.title) print (soup.title.text) Output The corresponding output will be as shown here − <title>Learn and Grow with Aditi Agarwal</title> Learn and Grow with Aditi Agarwal Lxml Another Python library we are going to discuss for web scraping is lxml. It is a highperformance HTML and XML parsing library. It is comparatively fast and straightforward. You can read about it more on https://lxml.de/. Installing lxml Using the pip command, we can install lxml either in our virtual environment or in global installation. (base) D:ProgramData>pip install lxml Collecting lxml Downloading https://files.pythonhosted.org/packages/b9/55/bcc78c70e8ba30f51b5495eb0e 3e949aa06e4a2de55b3de53dc9fa9653fa/lxml-4.2.5-cp36-cp36m-win_amd64.whl (3. 6MB) 100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 3.6MB 64kB/s Installing collected packages: lxml Successfully installed lxml-4.2.5 Example: Data extraction using lxml and requests In the following example, we are scraping a particular element of the web page from authoraditiagarwal.com by using lxml and requests − First, we need to import the requests and html from lxml library as follows − import requests from lxml import html Now we need to provide the url of web page to scrap url = https://authoraditiagarwal.com/leadershipmanagement/ Now we need to provide the path (Xpath) to particular element of that web page − path = ”//*[@id=”panel-836-0-0-1″]/div/div/p[1]” response = requests.get(url) byte_string = response.content source_code = html.fromstring(byte_string) tree = source_code.xpath(path) print(tree[0].text_content()) Output The corresponding output will be as shown here − The Sprint Burndown or the Iteration Burndown chart is a powerful tool to communicate daily progress to the stakeholders. It tracks the completion of work for a given sprint or an iteration. The horizontal axis represents the days within a Sprint. The vertical axis represents the hours remaining to complete the committed work. Print Page Previous Next Advertisements ”;