”;
Web scraping usually involves downloading, storing and processing the web media content. In this chapter, let us understand how to process the content downloaded from the web.
Introduction
The web media content that we obtain during scraping can be images, audio and video files, in the form of non-web pages as well as data files. But, can we trust the downloaded data especially on the extension of data we are going to download and store in our computer memory? This makes it essential to know about the type of data we are going to store locally.
Getting Media Content from Web Page
In this section, we are going to learn how we can download media content which correctly represents the media type based on the information from web server. We can do it with the help of Python requests module as we did in previous chapter.
First, we need to import necessary Python modules as follows −
import requests
Now, provide the URL of the media content we want to download and store locally.
url = "https://authoraditiagarwal.com/wpcontent/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg"
Use the following code to create HTTP response object.
r = requests.get(url)
With the help of following line of code, we can save the received content as .png file.
with open("ThinkBig.png",''wb'') as f: f.write(r.content)
After running the above Python script, we will get a file named ThinkBig.png, which would have the downloaded image.
Extracting Filename from URL
After downloading the content from web site, we also want to save it in a file with a file name found in the URL. But we can also check, if numbers of additional fragments exist in URL too. For this, we need to find the actual filename from the URL.
With the help of following Python script, using urlparse, we can extract the filename from URL −
import urllib3 import os url = "https://authoraditiagarwal.com/wpcontent/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg" a = urlparse(url) a.path
You can observe the output as shown below −
''/wp-content/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg'' os.path.basename(a.path)
You can observe the output as shown below −
''MetaSlider_ThinkBig-1080x180.jpg''
Once you run the above script, we will get the filename from URL.
Information about Type of Content from URL
While extracting the contents from web server, by GET request, we can also check its information provided by the web server. With the help of following Python script we can determine what web server means with the type of the content −
First, we need to import necessary Python modules as follows −
import requests
Now, we need to provide the URL of the media content we want to download and store locally.
url = "https://authoraditiagarwal.com/wpcontent/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg"
Following line of code will create HTTP response object.
r = requests.get(url, allow_redirects=True)
Now, we can get what type of information about content can be provided by web server.
for headers in r.headers: print(headers)
You can observe the output as shown below −
Date Server Upgrade Connection Last-Modified Accept-Ranges Content-Length Keep-Alive Content-Type
With the help of following line of code we can get the particular information about content type, say content-type −
print (r.headers.get(''content-type''))
You can observe the output as shown below −
image/jpeg
With the help of following line of code, we can get the particular information about content type, say EType −
print (r.headers.get(''ETag''))
You can observe the output as shown below −
None
Observe the following command −
print (r.headers.get(''content-length''))
You can observe the output as shown below −
12636
With the help of following line of code we can get the particular information about content type, say Server −
print (r.headers.get(''Server''))
You can observe the output as shown below −
Apache
Generating Thumbnail for Images
Thumbnail is a very small description or representation. A user may want to save only thumbnail of a large image or save both the image as well as thumbnail. In this section we are going to create a thumbnail of the image named ThinkBig.png downloaded in the previous section “Getting media content from web page”.
For this Python script, we need to install Python library named Pillow, a fork of the Python Image library having useful functions for manipulating images. It can be installed with the help of following command −
pip install pillow
The following Python script will create a thumbnail of the image and will save it to the current directory by prefixing thumbnail file with Th_
import glob from PIL import Image for infile in glob.glob("ThinkBig.png"): img = Image.open(infile) img.thumbnail((128, 128), Image.ANTIALIAS) if infile[0:2] != "Th_": img.save("Th_" + infile, "png")
The above code is very easy to understand and you can check for the thumbnail file in the current directory.
Screenshot from Website
In web scraping, a very common task is to take screenshot of a website. For implementing this, we are going to use selenium and webdriver. The following Python script will take the screenshot from website and will save it to current directory.
From selenium import webdriver path = r''C:\Users\gaurav\Desktop\Chromedriver'' browser = webdriver.Chrome(executable_path = path) browser.get(''https://tutorialspoint.com/'') screenshot = browser.save_screenshot(''screenshot.png'') browser.quit
You can observe the output as shown below −
DevTools listening on ws://127.0.0.1:1456/devtools/browser/488ed704-9f1b-44f0- a571-892dc4c90eb7 <bound method WebDriver.quit of <selenium.webdriver.chrome.webdriver.WebDriver (session="37e8e440e2f7807ef41ca7aa20ce7c97")>>
After running the script, you can check your current directory for screenshot.png file.
Thumbnail Generation for Video
Suppose we have downloaded videos from website and wanted to generate thumbnails for them so that a specific video, based on its thumbnail, can be clicked. For generating thumbnail for videos we need a simple tool called ffmpeg which can be downloaded from www.ffmpeg.org. After downloading, we need to install it as per the specifications of our OS.
The following Python script will generate thumbnail of the video and will save it to our local directory −
import subprocess video_MP4_file = “C:Usersgauravdesktopsolar.mp4 thumbnail_image_file = ''thumbnail_solar_video.jpg'' subprocess.call([''ffmpeg'', ''-i'', video_MP4_file, ''-ss'', ''00:00:20.000'', ''- vframes'', ''1'', thumbnail_image_file, "-y"])
After running the above script, we will get the thumbnail named thumbnail_solar_video.jpg saved in our local directory.
Ripping an MP4 video to an MP3
Suppose you have downloaded some video file from a website, but you only need audio from that file to serve your purpose, then it can be done in Python with the help of Python library called moviepy which can be installed with the help of following command −
pip install moviepy
Now, after successfully installing moviepy with the help of following script we can convert and MP4 to MP3.
import moviepy.editor as mp clip = mp.VideoFileClip(r"C:UsersgauravDesktop1234.mp4") clip.audio.write_audiofile("movie_audio.mp3")
You can observe the output as shown below −
[MoviePy] Writing audio in movie_audio.mp3 100%|¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦ ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 674/674 [00:01<00:00, 476.30it/s] [MoviePy] Done.
The above script will save the audio MP3 file in the local directory.
”;