Scrapy – Stats Collection ”; Previous Next Description Stats Collector is a facility provided by Scrapy to collect the stats in the form of key/values and it is accessed using the Crawler API (Crawler provides access to all Scrapy core components). The stats collector provides one stats table per spider in which the stats collector opens automatically when spider is opening and closes the stats collector when spider is closed. Common Stats Collector Uses The following code accesses the stats collector using stats attribute. class ExtensionThatAccessStats(object): def __init__(self, stats): self.stats = stats @classmethod def from_crawler(cls, crawler): return cls(crawler.stats) The following table shows various options can be used with stats collector − Sr.No Parameters Description 1 stats.set_value(”hostname”, socket.gethostname()) It is used to set the stats value. 2 stats.inc_value(”customized_count”) It increments the stat value. 3 stats.max_value(”max_items_scraped”, value) You can set the stat value, only if greater than previous value. 4 stats.min_value(”min_free_memory_percent”, value) You can set the stat value, only if lower than previous value. 5 stats.get_value(”customized_count”) It fetches the stat value. 6 stats.get_stats() {”custom_count”: 1, ”start_time”: datetime.datetime(2009, 7, 14, 21, 47, 28, 977139)} It fetches all the stats Available Stats Collectors Scrapy provides different types of stats collector which can be accessed using the STATS_CLASS setting. MemoryStatsCollector It is the default Stats collector that maintains the stats of every spider which was used for scraping and the data will be stored in the memory. class scrapy.statscollectors.MemoryStatsCollector DummyStatsCollector This stats collector is very efficient which does nothing. This can be set using the STATS_CLASS setting and can be used to disable the stats collection in order to improve the performance. class scrapy.statscollectors.DummyStatsCollector Print Page Previous Next Advertisements ”;
Category: scrapy
Scrapy – Items
Scrapy – Items ”; Previous Next Description Scrapy process can be used to extract the data from sources such as web pages using the spiders. Scrapy uses Item class to produce the output whose objects are used to gather the scraped data. Declaring Items You can declare the items using the class definition syntax along with the field objects shown as follows − import scrapy class MyProducts(scrapy.Item): productName = Field() productLink = Field() imageURL = Field() price = Field() size = Field() Item Fields The item fields are used to display the metadata for each field. As there is no limitation of values on the field objects, the accessible metadata keys does not ontain any reference list of the metadata. The field objects are used to specify all the field metadata and you can specify any other field key as per your requirement in the project. The field objects can be accessed using the Item.fields attribute. Working with Items There are some common functions which can be defined when you are working with the items. For more information, click this link. Extending Items The items can be extended by stating the subclass of the original item. For instance − class MyProductDetails(Product): original_rate = scrapy.Field(serializer = str) discount_rate = scrapy.Field() You can use the existing field metadata to extend the field metadata by adding more values or changing the existing values as shown in the following code − class MyProductPackage(Product): name = scrapy.Field(Product.fields[”name”], serializer = serializer_demo) Item Objects The item objects can be specified using the following class which provides the new initialized item from the given argument − class scrapy.item.Item([arg]) The Item provides a copy of the constructor and provides an extra attribute that is given by the items in the fields. Field Objects The field objects can be specified using the following class in which the Field class doesn”t issue the additional process or attributes − class scrapy.item.Field([arg]) Print Page Previous Next Advertisements ”;
Scrapy – Command Line Tools
Scrapy – Command Line Tools ”; Previous Next Description The Scrapy command line tool is used for controlling Scrapy, which is often referred to as ”Scrapy tool”. It includes the commands for various objects with a group of arguments and options. Configuration Settings Scrapy will find configuration settings in the scrapy.cfg file. Following are a few locations − C:scrapy(project folder)scrapy.cfg in the system ~/.config/scrapy.cfg ($XDG_CONFIG_HOME) and ~/.scrapy.cfg ($HOME) for global settings You can find the scrapy.cfg inside the root of the project. Scrapy can also be configured using the following environment variables − SCRAPY_SETTINGS_MODULE SCRAPY_PROJECT SCRAPY_PYTHON_SHELL Default Structure Scrapy Project The following structure shows the default file structure of the Scrapy project. scrapy.cfg – Deploy the configuration file project_name/ – Name of the project _init_.py items.py – It is project”s items file pipelines.py – It is project”s pipelines file settings.py – It is project”s settings file spiders – It is the spiders directory _init_.py spider_name.py . . . The scrapy.cfg file is a project root directory, which includes the project name with the project settings. For instance − [settings] default = [name of the project].settings [deploy] #url = http://localhost:6800/ project = [name of the project] Using Scrapy Tool Scrapy tool provides some usage and available commands as follows − Scrapy X.Y – no active project Usage: scrapy [options] [arguments] Available commands: crawl It puts spider (handle the URL) to work for crawling data fetch It fetches the response from the given URL Creating a Project You can use the following command to create the project in Scrapy − scrapy startproject project_name This will create the project called project_name directory. Next, go to the newly created project, using the following command − cd project_name Controlling Projects You can control the project and manage them using the Scrapy tool and also create the new spider, using the following command − scrapy genspider mydomain mydomain.com The commands such as crawl, etc. must be used inside the Scrapy project. You will come to know which commands must run inside the Scrapy project in the coming section. Scrapy contains some built-in commands, which can be used for your project. To see the list of available commands, use the following command − scrapy -h When you run the following command, Scrapy will display the list of available commands as listed − fetch − It fetches the URL using Scrapy downloader. runspider − It is used to run self-contained spider without creating a project. settings − It specifies the project setting value. shell − It is an interactive scraping module for the given URL. startproject − It creates a new Scrapy project. version − It displays the Scrapy version. view − It fetches the URL using Scrapy downloader and show the contents in a browser. You can have some project related commands as listed − crawl − It is used to crawl data using the spider. check − It checks the items returned by the crawled command. list − It displays the list of available spiders present in the project. edit − You can edit the spiders by using the editor. parse − It parses the given URL with the spider. bench − It is used to run quick benchmark test (Benchmark tells how many number of pages can be crawled per minute by Scrapy). Custom Project Commands You can build a custom project command with COMMANDS_MODULE setting in Scrapy project. It includes a default empty string in the setting. You can add the following custom command − COMMANDS_MODULE = ”mycmd.commands” Scrapy commands can be added using the scrapy.commands section in the setup.py file shown as follows − from setuptools import setup, find_packages setup(name = ”scrapy-module_demo”, entry_points = { ”scrapy.commands”: [ ”cmd_demo = my_module.commands:CmdDemo”, ], }, ) The above code adds cmd_demo command in the setup.py file. Print Page Previous Next Advertisements ”;
Scrapy – Overview
Scrapy – Overview ”; Previous Next Scrapy is a fast, open-source web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on XPath. Scrapy was first released on June 26, 2008 licensed under BSD, with a milestone 1.0 releasing in June 2015. Why Use Scrapy? It is easier to build and scale large crawling projects. It has a built-in mechanism called Selectors, for extracting the data from websites. It handles the requests asynchronously and it is fast. It automatically adjusts crawling speed using Auto-throttling mechanism. Ensures developer accessibility. Features of Scrapy Scrapy is an open source and free to use web crawling framework. Scrapy generates feed exports in formats such as JSON, CSV, and XML. Scrapy has built-in support for selecting and extracting data from sources either by XPath or CSS expressions. Scrapy based on crawler, allows extracting data from the web pages automatically. Advantages Scrapy is easily extensible, fast, and powerful. It is a cross-platform application framework (Windows, Linux, Mac OS and BSD). Scrapy requests are scheduled and processed asynchronously. Scrapy comes with built-in service called Scrapyd which allows to upload projects and control spiders using JSON web service. It is possible to scrap any website, though that website does not have API for raw data access. Disadvantages Scrapy is only for Python 2.7. + Installation is different for different operating systems. Print Page Previous Next Advertisements ”;
Scrapy – Home
Scrapy Tutorial PDF Version Quick Guide Resources Job Search Discussion Scrapy is a fast, open-source web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on XPath. Audience This tutorial is designed for software programmers who need to learn Scrapy web crawler from scratch. Prerequisites You should have a basic understanding of Computer Programming terminologies and Python. A basic understanding of XPath is a plus. Print Page Previous Next Advertisements ”;