Getting Started with Python

Getting Started with Python ”; Previous Next In the first chapter, we have learnt what web scraping is all about. In this chapter, let us see how to implement web scraping using Python. Why Python for Web Scraping? Python is a popular tool for implementing web scraping. Python programming language is also used for other useful projects related to cyber security, penetration testing as well as digital forensic applications. Using the base programming of Python, web scraping can be performed without using any other third party tool. Python programming language is gaining huge popularity and the reasons that make Python a good fit for web scraping projects are as below − Syntax Simplicity Python has the simplest structure when compared to other programming languages. This feature of Python makes the testing easier and a developer can focus more on programming. Inbuilt Modules Another reason for using Python for web scraping is the inbuilt as well as external useful libraries it possesses. We can perform many implementations related to web scraping by using Python as the base for programming. Open Source Programming Language Python has huge support from the community because it is an open source programming language. Wide range of Applications Python can be used for various programming tasks ranging from small shell scripts to enterprise web applications. Installation of Python Python distribution is available for platforms like Windows, MAC and Unix/Linux. We need to download only the binary code applicable for our platform to install Python. But in case if the binary code for our platform is not available, we must have a C compiler so that source code can be compiled manually. We can install Python on various platforms as follows − Installing Python on Unix and Linux You need to followings steps given below to install Python on Unix/Linux machines − Step 1 − Go to the link https://www.python.org/downloads/ Step 2 − Download the zipped source code available for Unix/Linux on above link. Step 3 − Extract the files onto your computer. Step 4 − Use the following commands to complete the installation − run ./configure script make make install You can find installed Python at the standard location /usr/local/bin and its libraries at /usr/local/lib/pythonXX, where XX is the version of Python. Installing Python on Windows You need to followings steps given below to install Python on Windows machines − Step 1 − Go to the link https://www.python.org/downloads/ Step 2 − Download the Windows installer python-XYZ.msi file, where XYZ is the version we need to install. Step 3 − Now, save the installer file to your local machine and run the MSI file. Step 4 − At last, run the downloaded file to bring up the Python install wizard. Installing Python on Macintosh We must use Homebrew for installing Python 3 on Mac OS X. Homebrew is easy to install and a great package installer. Homebrew can also be installed by using the following command − $ ruby -e “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)” For updating the package manager, we can use the following command − $ brew update With the help of the following command, we can install Python3 on our MAC machine − $ brew install python3 Setting Up the PATH You can use the following instructions to set up the path on various environments − Setting Up the Path on Unix/Linux Use the following commands for setting up paths using various command shells − For csh shell setenv PATH “$PATH:/usr/local/bin/python”. For bash shell (Linux) ATH=”$PATH:/usr/local/bin/python”. For sh or ksh shell PATH=”$PATH:/usr/local/bin/python”. Setting Up the Path on Windows For setting the path on Windows, we can use the path %path%;C:Python at the command prompt and then press Enter. Running Python We can start Python using any of the following three ways − Interactive Interpreter An operating system such as UNIX and DOS that is providing a command-line interpreter or shell can be used for starting Python. We can start coding in interactive interpreter as follows − Step 1 − Enter python at the command line. Step 2 − Then, we can start coding right away in the interactive interpreter. $python # Unix/Linux or python% # Unix/Linux or C:> python # Windows/DOS Script from the Command-line We can execute a Python script at command line by invoking the interpreter. It can be understood as follows − $python script.py # Unix/Linux or python% script.py # Unix/Linux or C: >python script.py # Windows/DOS Integrated Development Environment We can also run Python from GUI environment if the system is having GUI application that is supporting Python. Some IDEs that support Python on various platforms are given below − IDE for UNIX − UNIX, for Python, has IDLE IDE. IDE for Windows − Windows has PythonWin IDE which has GUI too. IDE for Macintosh − Macintosh has IDLE IDE which is downloadable as either MacBinary or BinHex”d files from the main website. Print Page Previous Next Advertisements ”;

Python Web Scraping – Discussion

Discuss Python Web Scraping ”; Previous Next Web scraping, also called web data mining or web harvesting, is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically. This tutorial will teach you various concepts of web scraping and makes you comfortable with scraping various types of websites and their data. Print Page Previous Next Advertisements ”;

Legality of Web Scraping

Legality of Web Scraping ”; Previous Next With Python, we can scrape any website or particular elements of a web page but do you have any idea whether it is legal or not? Before scraping any website we must have to know about the legality of web scraping. This chapter will explain the concepts related to legality of web scraping. Introduction Generally, if you are going to use the scraped data for personal use, then there may not be any problem. But if you are going to republish that data, then before doing the same you should make download request to the owner or do some background research about policies as well about the data you are going to scrape. Research Required Prior to Scraping If you are targeting a website for scraping data from it, we need to understand its scale and structure. Following are some of the files which we need to analyze before starting web scraping. Analyzing robots.txt Actually most of the publishers allow programmers to crawl their websites at some extent. In other sense, publishers want specific portions of the websites to be crawled. To define this, websites must put some rules for stating which portions can be crawled and which cannot be. Such rules are defined in a file called robots.txt. robots.txt is human readable file used to identify the portions of the website that crawlers are allowed as well as not allowed to scrape. There is no standard format of robots.txt file and the publishers of website can do modifications as per their needs. We can check the robots.txt file for a particular website by providing a slash and robots.txt after url of that website. For example, if we want to check it for Google.com, then we need to type https://www.google.com/robots.txt and we will get something as follows − User-agent: * Disallow: /search Allow: /search/about Allow: /search/static Allow: /search/howsearchworks Disallow: /sdch Disallow: /groups Disallow: /index.html? Disallow: /? Allow: /?hl= Disallow: /?hl=*& Allow: /?hl=*&gws_rd=ssl$ and so on…….. Some of the most common rules that are defined in a website’s robots.txt file are as follows − User-agent: BadCrawler Disallow: / The above rule means the robots.txt file asks a crawler with BadCrawler user agent not to crawl their website. User-agent: * Crawl-delay: 5 Disallow: /trap The above rule means the robots.txt file delays a crawler for 5 seconds between download requests for all user-agents for avoiding overloading server. The /trap link will try to block malicious crawlers who follow disallowed links. There are many more rules that can be defined by the publisher of the website as per their requirements. Some of them are discussed here − Analyzing Sitemap files What you supposed to do if you want to crawl a website for updated information? You will crawl every web page for getting that updated information, but this will increase the server traffic of that particular website. That is why websites provide sitemap files for helping the crawlers to locate updating content without needing to crawl every web page. Sitemap standard is defined at http://www.sitemaps.org/protocol.html. Content of Sitemap file The following is the content of sitemap file of https://www.microsoft.com/robots.txt that is discovered in robot.txt file − Sitemap: https://www.microsoft.com/en-us/explore/msft_sitemap_index.xml Sitemap: https://www.microsoft.com/learning/sitemap.xml Sitemap: https://www.microsoft.com/en-us/licensing/sitemap.xml Sitemap: https://www.microsoft.com/en-us/legal/sitemap.xml Sitemap: https://www.microsoft.com/filedata/sitemaps/RW5xN8 Sitemap: https://www.microsoft.com/store/collections.xml Sitemap: https://www.microsoft.com/store/productdetailpages.index.xml Sitemap: https://www.microsoft.com/en-us/store/locations/store-locationssitemap.xml The above content shows that the sitemap lists the URLs on website and further allows a webmaster to specify some additional information like last updated date, change of contents, importance of URL with relation to others etc. about each URL. What is the Size of Website? Is the size of a website, i.e. the number of web pages of a website affects the way we crawl? Certainly yes. Because if we have less number of web pages to crawl, then the efficiency would not be a serious issue, but suppose if our website has millions of web pages, for example Microsoft.com, then downloading each web page sequentially would take several months and then efficiency would be a serious concern. Checking Website’s Size By checking the size of result of Google’s crawler, we can have an estimate of the size of a website. Our result can be filtered by using the keyword site while doing the Google search. For example, estimating the size of https://authoraditiagarwal.com/ is given below − You can see there are around 60 results which mean it is not a big website and crawling would not lead the efficiency issue. Which technology is used by website? Another important question is whether the technology used by website affects the way we crawl? Yes, it affects. But how we can check about the technology used by a website? There is a Python library named builtwith with the help of which we can find out about the technology used by a website. Example In this example we are going to check the technology used by the website https://authoraditiagarwal.com with the help of Python library builtwith. But before using this library, we need to install it as follows − (base) D:ProgramData>pip install builtwith Collecting builtwith Downloading https://files.pythonhosted.org/packages/9b/b8/4a320be83bb3c9c1b3ac3f9469a5d66e0 2918e20d226aa97a3e86bddd130/builtwith-1.3.3.tar.gz Requirement already satisfied: six in d:programdatalibsite-packages (from builtwith) (1.10.0) Building wheels for collected packages: builtwith Running setup.py bdist_wheel for builtwith … done Stored in directory: C:UsersgauravAppDataLocalpipCachewheels2b0c2a96241e7fe520e75093898b f926764a924873e0304f10b2524 Successfully built builtwith Installing collected packages: builtwith Successfully installed builtwith-1.3.3 Now, with the help of following simple line of codes we can check the technology used by a particular website − In [1]: import builtwith In [2]: builtwith.parse(”http://authoraditiagarwal.com”) Out[2]: {”blogs”: [”PHP”, ”WordPress”], ”cms”: [”WordPress”], ”ecommerce”: [”WooCommerce”], ”font-scripts”: [”Font Awesome”], ”javascript-frameworks”: [”jQuery”], ”programming-languages”: [”PHP”], ”web-servers”: [”Apache”]} Who is the owner of website? The owner of the website also matters because if the owner is known for blocking the crawlers, then the crawlers must be careful while scraping the data from website. There is a protocol named Whois with the help of which we can find out about the owner of the website. Example In this example we are going to check the owner of the website say microsoft.com

Scraping Dynamic Websites

Python Web Scraping – Dynamic Websites ”; Previous Next In this chapter, let us learn how to perform web scraping on dynamic websites and the concepts involved in detail. Introduction Web scraping is a complex task and the complexity multiplies if the website is dynamic. According to United Nations Global Audit of Web Accessibility more than 70% of the websites are dynamic in nature and they rely on JavaScript for their functionalities. Dynamic Website Example Let us look at an example of a dynamic website and know about why it is difficult to scrape. Here we are going to take example of searching from a website named http://example.webscraping.com/places/default/search. But how can we say that this website is of dynamic nature? It can be judged from the output of following Python script which will try to scrape data from above mentioned webpage − import re import urllib.request response = urllib.request.urlopen(”http://example.webscraping.com/places/default/search”) html = response.read() text = html.decode() re.findall(”(.*?)”,text) Output [ ] The above output shows that the example scraper failed to extract information because the <div> element we are trying to find is empty. Approaches for Scraping data from Dynamic Websites We have seen that the scraper cannot scrape the information from a dynamic website because the data is loaded dynamically with JavaScript. In such cases, we can use the following two techniques for scraping data from dynamic JavaScript dependent websites − Reverse Engineering JavaScript Rendering JavaScript Reverse Engineering JavaScript The process called reverse engineering would be useful and lets us understand how data is loaded dynamically by web pages. For doing this, we need to click the inspect element tab for a specified URL. Next, we will click NETWORK tab to find all the requests made for that web page including search.json with a path of /ajax. Instead of accessing AJAX data from browser or via NETWORK tab, we can do it with the help of following Python script too − import requests url=requests.get(”http://example.webscraping.com/ajax/search.json?page=0&page_size=10&search_term=a”) url.json() Example The above script allows us to access JSON response by using Python json method. Similarly we can download the raw string response and by using python’s json.loads method, we can load it too. We are doing this with the help of following Python script. It will basically scrape all of the countries by searching the letter of the alphabet ‘a’ and then iterating the resulting pages of the JSON responses. import requests import string PAGE_SIZE = 15 url = ”http://example.webscraping.com/ajax/” + ”search.json?page={}&page_size={}&search_term=a” countries = set() for letter in string.ascii_lowercase: print(”Searching with %s” % letter) page = 0 while True: response = requests.get(url.format(page, PAGE_SIZE, letter)) data = response.json() print(”adding %d records from the page %d” %(len(data.get(”records”)),page)) for record in data.get(”records”):countries.add(record[”country”]) page += 1 if page >= data[”num_pages”]: break with open(”countries.txt”, ”w”) as countries_file: countries_file.write(”n”.join(sorted(countries))) After running the above script, we will get the following output and the records would be saved in the file named countries.txt. Output Searching with a adding 15 records from the page 0 adding 15 records from the page 1 … Rendering JavaScript In the previous section, we did reverse engineering on web page that how API worked and how we can use it to retrieve the results in single request. However, we can face following difficulties while doing reverse engineering − Sometimes websites can be very difficult. For example, if the website is made with advanced browser tool such as Google Web Toolkit (GWT), then the resulting JS code would be machine-generated and difficult to understand and reverse engineer. Some higher level frameworks like React.js can make reverse engineering difficult by abstracting already complex JavaScript logic. The solution to the above difficulties is to use a browser rendering engine that parses HTML, applies the CSS formatting and executes JavaScript to display a web page. Example In this example, for rendering Java Script we are going to use a familiar Python module Selenium. The following Python code will render a web page with the help of Selenium − First, we need to import webdriver from selenium as follows − from selenium import webdriver Now, provide the path of web driver which we have downloaded as per our requirement − path = r”C:\Users\gaurav\Desktop\Chromedriver” driver = webdriver.Chrome(executable_path = path) Now, provide the url which we want to open in that web browser now controlled by our Python script. driver.get(”http://example.webscraping.com/search”) Now, we can use ID of the search toolbox for setting the element to select. driver.find_element_by_id(”search_term”).send_keys(”.”) Next, we can use java script to set the select box content as follows − js = “document.getElementById(”page_size”).options[1].text = ”100”;” driver.execute_script(js) The following line of code shows that search is ready to be clicked on the web page − driver.find_element_by_id(”search”).click() Next line of code shows that it will wait for 45 seconds for completing the AJAX request. driver.implicitly_wait(45) Now, for selecting country links, we can use the CSS selector as follows − links = driver.find_elements_by_css_selector(”#results a”) Now the text of each link can be extracted for creating the list of countries − countries = [link.text for link in links] print(countries) driver.close() Print Page Previous Next Advertisements ”;

Python Modules for Web Scraping

Python Modules for Web Scraping ”; Previous Next In this chapter, let us learn various Python modules that we can use for web scraping. Python Development Environments using virtualenv Virtualenv is a tool to create isolated Python environments. With the help of virtualenv, we can create a folder that contains all necessary executables to use the packages that our Python project requires. It also allows us to add and modify Python modules without access to the global installation. You can use the following command to install virtualenv − (base) D:ProgramData>pip install virtualenv Collecting virtualenv Downloading https://files.pythonhosted.org/packages/b6/30/96a02b2287098b23b875bc8c2f58071c3 5d2efe84f747b64d523721dc2b5/virtualenv-16.0.0-py2.py3-none-any.whl (1.9MB) 100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 1.9MB 86kB/s Installing collected packages: virtualenv Successfully installed virtualenv-16.0.0 Now, we need to create a directory which will represent the project with the help of following command − (base) D:ProgramData>mkdir webscrap Now, enter into that directory with the help of this following command − (base) D:ProgramData>cd webscrap Now, we need to initialize virtual environment folder of our choice as follows − (base) D:ProgramDatawebscrap>virtualenv websc Using base prefix ”d:\programdata” New python executable in D:ProgramDatawebscrapwebscScriptspython.exe Installing setuptools, pip, wheel…done. Now, activate the virtual environment with the command given below. Once successfully activated, you will see the name of it on the left hand side in brackets. (base) D:ProgramDatawebscrap>webscscriptsactivate We can install any module in this environment as follows − (websc) (base) D:ProgramDatawebscrap>pip install requests Collecting requests Downloading https://files.pythonhosted.org/packages/65/47/7e02164a2a3db50ed6d8a6ab1d6d60b69 c4c3fdf57a284257925dfc12bda/requests-2.19.1-py2.py3-none-any.whl (9 1kB) 100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 92kB 148kB/s Collecting chardet<3.1.0,>=3.0.2 (from requests) Downloading https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca 55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl (133 kB) 100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 143kB 369kB/s Collecting certifi>=2017.4.17 (from requests) Downloading https://files.pythonhosted.org/packages/df/f7/04fee6ac349e915b82171f8e23cee6364 4d83663b34c539f7a09aed18f9e/certifi-2018.8.24-py2.py3-none-any.whl (147kB) 100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 153kB 527kB/s Collecting urllib3<1.24,>=1.21.1 (from requests) Downloading https://files.pythonhosted.org/packages/bd/c9/6fdd990019071a4a32a5e7cb78a1d92c5 3851ef4f56f62a3486e6a7d8ffb/urllib3-1.23-py2.py3-none-any.whl (133k B) 100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 143kB 517kB/s Collecting idna<2.8,>=2.5 (from requests) Downloading https://files.pythonhosted.org/packages/4b/2a/0276479a4b3caeb8a8c1af2f8e4355746 a97fab05a372e4a2c6a6b876165/idna-2.7-py2.py3-none-any.whl (58kB) 100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 61kB 339kB/s Installing collected packages: chardet, certifi, urllib3, idna, requests Successfully installed certifi-2018.8.24 chardet-3.0.4 idna-2.7 requests-2.19.1 urllib3-1.23 For deactivating the virtual environment, we can use the following command − (websc) (base) D:ProgramDatawebscrap>deactivate (base) D:ProgramDatawebscrap> You can see that (websc) has been deactivated. Python Modules for Web Scraping Web scraping is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically. In other words, instead of manually saving the data from websites, the web scraping software will automatically load and extract data from multiple websites as per our requirement. In this section, we are going to discuss about useful Python libraries for web scraping. Requests It is a simple python web scraping library. It is an efficient HTTP library used for accessing web pages. With the help of Requests, we can get the raw HTML of web pages which can then be parsed for retrieving the data. Before using requests, let us understand its installation. Installing Requests We can install it in either on our virtual environment or on the global installation. With the help of pip command, we can easily install it as follows − (base) D:ProgramData> pip install requests Collecting requests Using cached https://files.pythonhosted.org/packages/65/47/7e02164a2a3db50ed6d8a6ab1d6d60b69 c4c3fdf57a284257925dfc12bda/requests-2.19.1-py2.py3-none-any.whl Requirement already satisfied: idna<2.8,>=2.5 in d:programdatalibsitepackages (from requests) (2.6) Requirement already satisfied: urllib3<1.24,>=1.21.1 in d:programdatalibsite-packages (from requests) (1.22) Requirement already satisfied: certifi>=2017.4.17 in d:programdatalibsitepackages (from requests) (2018.1.18) Requirement already satisfied: chardet<3.1.0,>=3.0.2 in d:programdatalibsite-packages (from requests) (3.0.4) Installing collected packages: requests Successfully installed requests-2.19.1 Example In this example, we are making a GET HTTP request for a web page. For this we need to first import requests library as follows − In [1]: import requests In this following line of code, we use requests to make a GET HTTP requests for the url: https://authoraditiagarwal.com/ by making a GET request. In [2]: r = requests.get(”https://authoraditiagarwal.com/”) Now we can retrieve the content by using .text property as follows − In [5]: r.text[:200] Observe that in the following output, we got the first 200 characters. Out[5]: ”<!DOCTYPE html>n<html lang=”en-US”ntitemscope ntitemtype=”http://schema.org/WebSite” ntprefix=”og: http://ogp.me/ns#” >n<head>nt<meta charset =”UTF-8″ />nt<meta http-equiv=”X-UA-Compatible” content=”IE” Urllib3 It is another Python library that can be used for retrieving data from URLs similar to the requests library. You can read more on this at its technical documentation at https://urllib3.readthedocs.io/en/latest/. Installing Urllib3 Using the pip command, we can install urllib3 either in our virtual environment or in global installation. (base) D:ProgramData>pip install urllib3 Collecting urllib3 Using cached https://files.pythonhosted.org/packages/bd/c9/6fdd990019071a4a32a5e7cb78a1d92c5 3851ef4f56f62a3486e6a7d8ffb/urllib3-1.23-py2.py3-none-any.whl Installing collected packages: urllib3 Successfully installed urllib3-1.23 Example: Scraping using Urllib3 and BeautifulSoup In the following example, we are scraping the web page by using Urllib3 and BeautifulSoup. We are using Urllib3 at the place of requests library for getting the raw data (HTML) from web page. Then we are using BeautifulSoup for parsing that HTML data. import urllib3 from bs4 import BeautifulSoup http = urllib3.PoolManager() r = http.request(”GET”, ”https://authoraditiagarwal.com”) soup = BeautifulSoup(r.data, ”lxml”) print (soup.title) print (soup.title.text) This is the output you will observe when you run this code − <title>Learn and Grow with Aditi Agarwal</title> Learn and Grow with Aditi Agarwal Selenium It is an open source automated testing suite for web applications across different browsers and platforms. It is not a single tool but a suite of software. We have selenium bindings for Python, Java, C#, Ruby and JavaScript. Here we are going to perform web scraping by using selenium and its Python bindings. You can learn more about Selenium with Java on the link Selenium. Selenium Python bindings provide a convenient API to access Selenium WebDrivers like Firefox, IE, Chrome, Remote etc. The current supported Python versions are 2.7, 3.5 and above. Installing Selenium Using the pip command, we can install urllib3 either in our virtual environment or in global installation. pip install selenium As selenium requires a driver to interface with the chosen browser, we need to download it. The following table shows different browsers and their links for downloading the same. Chrome https://sites.google.com/a/chromium.org/ Edge https://developer.microsoft.com/ Firefox https://github.com/ Safari https://webkit.org/ Example This example shows web scraping using selenium. It can also be used for testing which is called selenium testing. After downloading the particular driver for the specified version of browser, we need to do programming in Python. First, need to import webdriver from selenium as follows − from selenium

Scraping Form based Websites

Python Web Scraping – Form based Websites ”; Previous Next In the previous chapter, we have seen scraping dynamic websites. In this chapter, let us understand scraping of websites that work on user based inputs, that is form based websites. Introduction These days WWW (World Wide Web) is moving towards social media as well as usergenerated contents. So the question arises how we can access such kind of information that is beyond login screen? For this we need to deal with forms and logins. In previous chapters, we worked with HTTP GET method to request information but in this chapter we will work with HTTP POST method that pushes information to a web server for storage and analysis. Interacting with Login forms While working on Internet, you must have interacted with login forms many times. They may be very simple like including only a very few HTML fields, a submit button and an action page or they may be complicated and have some additional fields like email, leave a message along with captcha for security reasons. In this section, we are going to deal with a simple submit form with the help of Python requests library. First, we need to import requests library as follows − import requests Now, we need to provide the information for the fields of login form. parameters = {‘Name’:’Enter your name’, ‘Email-id’:’Your Emailid’,’Message’:’Type your message here’} In next line of code, we need to provide the URL on which action of the form would happen. r = requests.post(“enter the URL”, data = parameters) print(r.text) After running the script, it will return the content of the page where action has happened. Suppose if you want to submit any image with the form, then it is very easy with requests.post(). You can understand it with the help of following Python script − import requests file = {‘Uploadfile’: open(’C:Usresdesktop123.png’,‘rb’)} r = requests.post(“enter the URL”, files = file) print(r.text) Loading Cookies from the Web Server A cookie, sometimes called web cookie or internet cookie, is a small piece of data sent from a website and our computer stores it in a file located inside our web browser. In the context of dealings with login forms, cookies can be of two types. One, we dealt in the previous section, that allows us to submit information to a website and second which lets us to remain in a permanent “logged-in” state throughout our visit to the website. For the second kind of forms, websites use cookies to keep track of who is logged in and who is not. What do cookies do? These days most of the websites are using cookies for tracking. We can understand the working of cookies with the help of following steps − Step 1 − First, the site will authenticate our login credentials and stores it in our browser’s cookie. This cookie generally contains a server-generated toke, time-out and tracking information. Step 2 − Next, the website will use the cookie as a proof of authentication. This authentication is always shown whenever we visit the website. Cookies are very problematic for web scrapers because if web scrapers do not keep track of the cookies, the submitted form is sent back and at the next page it seems that they never logged in. It is very easy to track the cookies with the help of Python requests library, as shown below − import requests parameters = {‘Name’:’Enter your name’, ‘Email-id’:’Your Emailid’,’Message’:’Type your message here’} r = requests.post(“enter the URL”, data = parameters) In the above line of code, the URL would be the page which will act as the processor for the login form. print(‘The cookie is:’) print(r.cookies.get_dict()) print(r.text) After running the above script, we will retrieve the cookies from the result of last request. There is another issue with cookies that sometimes websites frequently modify cookies without warning. Such kind of situation can be dealt with requests.Session() as follows − import requests session = requests.Session() parameters = {‘Name’:’Enter your name’, ‘Email-id’:’Your Emailid’,’Message’:’Type your message here’} r = session.post(“enter the URL”, data = parameters) In the above line of code, the URL would be the page which will act as the processor for the login form. print(‘The cookie is:’) print(r.cookies.get_dict()) print(r.text) Observe that you can easily understand the difference between script with session and without session. Automating forms with Python In this section we are going to deal with a Python module named Mechanize that will reduce our work and automate the process of filling up forms. Mechanize module Mechanize module provides us a high-level interface to interact with forms. Before starting using it we need to install it with the following command − pip install mechanize Note that it would work only in Python 2.x. Example In this example, we are going to automate the process of filling a login form having two fields namely email and password − import mechanize brwsr = mechanize.Browser() brwsr.open(Enter the URL of login) brwsr.select_form(nr = 0) brwsr[”email”] = ‘Enter email’ brwsr[”password”] = ‘Enter password’ response = brwsr.submit() brwsr.submit() The above code is very easy to understand. First, we imported mechanize module. Then a Mechanize browser object has been created. Then, we navigated to the login URL and selected the form. After that, names and values are passed directly to the browser object. Print Page Previous Next Advertisements ”;