Scrapy – Useful Resources

Scrapy – Useful Resources ”; Previous Next The following resources contain additional information on Scrapy. Please use them to get more in-depth knowledge on this. Useful Video Courses Scrapy Course: Python Web Scraping & Crawling for Beginners 28 Lectures 3.5 hours Attreya Bhatt More Detail Web Scraping using API, Beautiful Soup using Python 39 Lectures 3.5 hours Chandramouli Jayendran More Detail A-Z Python Bootcamp- Basics To Data Science (50+ Hours) Best Seller 436 Lectures 46 hours Chandramouli Jayendran More Detail Data Scraping and Data Mining from Beginner to Pro with Python 150 Lectures 13.5 hours Packt Publishing More Detail Data Scraping and Data Mining from Beginner to Professional 152 Lectures 14 hours AI Sciences More Detail 50 Hours of Big Data, PySpark, AWS, Scala and Scraping 622 Lectures 54.5 hours AI Sciences More Detail Print Page Previous Next Advertisements ”;

Scrapy – Extracting Items

Scrapy – Extracting Items ”; Previous Next Description For extracting data from web pages, Scrapy uses a technique called selectors based on XPath and CSS expressions. Following are some examples of XPath expressions − /html/head/title − This will select the <title> element, inside the <head> element of an HTML document. /html/head/title/text() − This will select the text within the same <title> element. //td − This will select all the elements from <td>. //div[@class = “slice”] − This will select all elements from div which contain an attribute class = “slice” Selectors have four basic methods as shown in the following table − Sr.No Method & Description 1 extract() It returns a unicode string along with the selected data. 2 re() It returns a list of unicode strings, extracted when the regular expression was given as argument. 3 xpath() It returns a list of selectors, which represents the nodes selected by the xpath expression given as an argument. 4 css() It returns a list of selectors, which represents the nodes selected by the CSS expression given as an argument. Using Selectors in the Shell To demonstrate the selectors with the built-in Scrapy shell, you need to have IPython installed in your system. The important thing here is, the URLs should be included within the quotes while running Scrapy; otherwise the URLs with ”&” characters won”t work. You can start a shell by using the following command in the project”s top level directory − scrapy shell “http://www.dmoz.org/Computers/Programming/Languages/Python/Books/” A shell will look like the following − [ … Scrapy log here … ] 2014-01-23 17:11:42-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>(referer: None) [s] Available Scrapy objects: [s] crawler <scrapy.crawler.Crawler object at 0x3636b50> [s] item {} [s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> [s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> [s] settings <scrapy.settings.Settings object at 0x3fadc50> [s] spider <Spider ”default” at 0x3cebf50> [s] Useful shortcuts: [s] shelp() Shell help (print this help) [s] fetch(req_or_url) Fetch request (or URL) and update local objects [s] view(response) View response in a browser In [1]: When shell loads, you can access the body or header by using response.body and response.header respectively. Similarly, you can run queries on the response using response.selector.xpath() or response.selector.css(). For instance − In [1]: response.xpath(”//title”) Out[1]: [<Selector xpath = ”//title” data = u”<title>My Book – Scrapy”>] In [2]: response.xpath(”//title”).extract() Out[2]: [u”<title>My Book – Scrapy: Index: Chapters</title>”] In [3]: response.xpath(”//title/text()”) Out[3]: [<Selector xpath = ”//title/text()” data = u”My Book – Scrapy: Index:”>] In [4]: response.xpath(”//title/text()”).extract() Out[4]: [u”My Book – Scrapy: Index: Chapters”] In [5]: response.xpath(”//title/text()”).re(”(w+):”) Out[5]: [u”Scrapy”, u”Index”, u”Chapters”] Extracting the Data To extract data from a normal HTML site, we have to inspect the source code of the site to get XPaths. After inspecting, you can see that the data will be in the ul tag. Select the elements within li tag. The following lines of code shows extraction of different types of data − For selecting data within li tag − response.xpath(”//ul/li”) For selecting descriptions − response.xpath(”//ul/li/text()”).extract() For selecting site titles − response.xpath(”//ul/li/a/text()”).extract() For selecting site links − response.xpath(”//ul/li/a/@href”).extract() The following code demonstrates the use of above extractors − import scrapy class MyprojectSpider(scrapy.Spider): name = “project” allowed_domains = [“dmoz.org”] start_urls = [ “http://www.dmoz.org/Computers/Programming/Languages/Python/Books/”, “http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/” ] def parse(self, response): for sel in response.xpath(”//ul/li”): title = sel.xpath(”a/text()”).extract() link = sel.xpath(”a/@href”).extract() desc = sel.xpath(”text()”).extract() print title, link, desc Print Page Previous Next Advertisements ”;

Scrapy – Using an Item

Scrapy – Using an Item ”; Previous Next Description Item objects are the regular dicts of Python. We can use the following syntax to access the attributes of the class − >>> item = DmozItem() >>> item[”title”] = ”sample title” >>> item[”title”] ”sample title” Add the above code to the following example − import scrapy from tutorial.items import DmozItem class MyprojectSpider(scrapy.Spider): name = “project” allowed_domains = [“dmoz.org”] start_urls = [ “http://www.dmoz.org/Computers/Programming/Languages/Python/Books/”, “http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/” ] def parse(self, response): for sel in response.xpath(”//ul/li”): item = DmozItem() item[”title”] = sel.xpath(”a/text()”).extract() item[”link”] = sel.xpath(”a/@href”).extract() item[”desc”] = sel.xpath(”text()”).extract() yield item The output of the above spider will be − [scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> {”desc”: [u” – By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.n], ”link”: [u”http://gnosis.cx/TPiP/”], ”title”: [u”Text Processing in Python”]} [scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> {”desc”: [u” – By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]n”], ”link”: [u”http://www.informit.com/store/product.aspx?isbn=0130211192”], ”title”: [u”XML Processing with Python”]} Print Page Previous Next Advertisements ”;

Scrapy – Selectors

Scrapy – Selectors ”; Previous Next Description When you are scraping the web pages, you need to extract a certain part of the HTML source by using the mechanism called selectors, achieved by using either XPath or CSS expressions. Selectors are built upon the lxml library, which processes the XML and HTML in Python language. Use the following code snippet to define different concepts of selectors − <html> <head> <title>My Website</title> </head> <body> <span>Hello world!!!</span> <div class = ”links”> <a href = ”one.html”>Link 1<img src = ”image1.jpg”/></a> <a href = ”two.html”>Link 2<img src = ”image2.jpg”/></a> <a href = ”three.html”>Link 3<img src = ”image3.jpg”/></a> </div> </body> </html> Constructing Selectors You can construct the selector class instances by passing the text or TextResponse object. Based on the provided input type, the selector chooses the following rules − from scrapy.selector import Selector from scrapy.http import HtmlResponse Using the above code, you can construct from the text as − Selector(text = body).xpath(”//span/text()”).extract() It will display the result as − [u”Hello world!!!”] You can construct from the response as − response = HtmlResponse(url = ”http://mysite.com”, body = body) Selector(response = response).xpath(”//span/text()”).extract() It will display the result as − [u”Hello world!!!”] Using Selectors Using the above simple code snippet, you can construct the XPath for selecting the text which is defined in the title tag as shown below − >>response.selector.xpath(”//title/text()”) Now, you can extract the textual data using the .extract() method shown as follows − >>response.xpath(”//title/text()”).extract() It will produce the result as − [u”My Website”] You can display the name of all elements shown as follows − >>response.xpath(”//div[@class = “links”]/a/text()”).extract() It will display the elements as − Link 1 Link 2 Link 3 If you want to extract the first element, then use the method .extract_first(), shown as follows − >>response.xpath(”//div[@class = “links”]/a/text()”).extract_first() It will display the element as − Link 1 Nesting Selectors Using the above code, you can nest the selectors to display the page link and image source using the .xpath() method, shown as follows − links = response.xpath(”//a[contains(@href, “image”)]”) for index, link in enumerate(links): args = (index, link.xpath(”@href”).extract(), link.xpath(”img/@src”).extract()) print ”The link %d pointing to url %s and image %s” % args It will display the result as − Link 1 pointing to url [u”one.html”] and image [u”image1.jpg”] Link 2 pointing to url [u”two.html”] and image [u”image2.jpg”] Link 3 pointing to url [u”three.html”] and image [u”image3.jpg”] Selectors Using Regular Expressions Scrapy allows to extract the data using regular expressions, which uses the .re() method. From the above HTML code, we will extract the image names shown as follows − >>response.xpath(”//a[contains(@href, “image”)]/text()”).re(r”Name:s*(.*)”) The above line displays the image names as − [u”Link 1”, u”Link 2”, u”Link 3”] Using Relative XPaths When you are working with XPaths, which starts with the /, nested selectors and XPath are related to absolute path of the document, and not the relative path of the selector. If you want to extract the <p> elements, then first gain all div elements − >>mydiv = response.xpath(”//div”) Next, you can extract all the ”p” elements inside, by prefixing the XPath with a dot as .//p as shown below − >>for p in mydiv.xpath(”.//p”).extract() Using EXSLT Extensions The EXSLT is a community that issues the extensions to the XSLT (Extensible Stylesheet Language Transformations) which converts XML documents to XHTML documents. You can use the EXSLT extensions with the registered namespace in the XPath expressions as shown in the following table − Sr.No Prefix & Usage Namespace 1 re regular expressions http://exslt.org/regexp/index.html 2 set set manipulation http://exslt.org/set/index.html You can check the simple code format for extracting data using regular expressions in the previous section. There are some XPath tips, which are useful when using XPath with Scrapy selectors. For more information, click this link. Print Page Previous Next Advertisements ”;

Scrapy – Create a Project

Scrapy – Create a Project ”; Previous Next Description To scrap the data from web pages, first you need to create the Scrapy project where you will be storing the code. To create a new directory, run the following command − scrapy startproject first_scrapy The above code will create a directory with name first_scrapy and it will contain the following structure − first_scrapy/ scrapy.cfg # deploy configuration file first_scrapy/ # project”s Python module, you”ll import your code from here __init__.py items.py # project items file pipelines.py # project pipelines file settings.py # project settings file spiders/ # a directory where you”ll later put your spiders __init__.py Print Page Previous Next Advertisements ”;

Scrapy – Exceptions

Scrapy – Exceptions ”; Previous Next Description The irregular events are referred to as exceptions. In Scrapy, exceptions are raised due to reasons such as missing configuration, dropping item from the item pipeline, etc. Following is the list of exceptions mentioned in Scrapy and their application. DropItem Item Pipeline utilizes this exception to stop processing of the item at any stage. It can be written as − exception (scrapy.exceptions.DropItem) CloseSpider This exception is used to stop the spider using the callback request. It can be written as − exception (scrapy.exceptions.CloseSpider)(reason = ”cancelled”) It contains parameter called reason (str) which specifies the reason for closing. For instance, the following code shows this exception usage − def parse_page(self, response): if ”Bandwidth exceeded” in response.body: raise CloseSpider(”bandwidth_exceeded”) IgnoreRequest This exception is used by scheduler or downloader middleware to ignore a request. It can be written as − exception (scrapy.exceptions.IgnoreRequest) NotConfigured It indicates a missing configuration situation and should be raised in a component constructor. exception (scrapy.exceptions.NotConfigured) This exception can be raised, if any of the following components are disabled. Extensions Item pipelines Downloader middlewares Spider middlewares NotSupported This exception is raised when any feature or method is not supported. It can be written as − exception (scrapy.exceptions.NotSupported) Print Page Previous Next Advertisements ”;

Scrapy – Shell

Scrapy – Shell ”; Previous Next Description Scrapy shell can be used to scrap the data with error free code, without the use of spider. The main purpose of Scrapy shell is to test the extracted code, XPath, or CSS expressions. It also helps specify the web pages from which you are scraping the data. Configuring the Shell The shell can be configured by installing the IPython (used for interactive computing) console, which is a powerful interactive shell that gives the auto completion, colorized output, etc. If you are working on the Unix platform, then it”s better to install the IPython. You can also use bpython, if IPython is inaccessible. You can configure the shell by setting the environment variable called SCRAPY_PYTHON_SHELL or by defining the scrapy.cfg file as follows − [settings] shell = bpython Launching the Shell Scrapy shell can be launched using the following command − scrapy shell <url> The url specifies the URL for which the data needs to be scraped. Using the Shell The shell provides some additional shortcuts and Scrapy objects as described in the following table − Available Shortcuts Shell provides the following available shortcuts in the project − Sr.No Shortcut & Description 1 shelp() It provides the available objects and shortcuts with the help option. 2 fetch(request_or_url) It collects the response from the request or URL and associated objects will get updated properly. 3 view(response) You can view the response for the given request in the local browser for observation and to display the external link correctly, it appends a base tag to the response body. Available Scrapy Objects Shell provides the following available Scrapy objects in the project − Sr.No Object & Description 1 crawler It specifies the current crawler object. 2 spider If there is no spider for present URL, then it will handle the URL or spider object by defining the new spider. 3 request It specifies the request object for the last collected page. 4 response It specifies the response object for the last collected page. 5 settings It provides the current Scrapy settings. Example of Shell Session Let us try scraping scrapy.org site and then begin to scrap the data from reddit.com as described. Before moving ahead, first we will launch the shell as shown in the following command − scrapy shell ”http://scrapy.org” –nolog Scrapy will display the available objects while using the above URL − [s] Available Scrapy objects: [s] crawler <scrapy.crawler.Crawler object at 0x1e16b50> [s] item {} [s] request <GET http://scrapy.org > [s] response <200 http://scrapy.org > [s] settings <scrapy.settings.Settings object at 0x2bfd650> [s] spider <Spider ”default” at 0x20c6f50> [s] Useful shortcuts: [s] shelp() Provides available objects and shortcuts with help option [s] fetch(req_or_url) Collects the response from the request or URL and associated objects will get update [s] view(response) View the response for the given request Next, begin with the working of objects, shown as follows − >> response.xpath(”//title/text()”).extract_first() u”Scrapy | A Fast and Powerful Scraping and Web Crawling Framework” >> fetch(“http://reddit.com”) [s] Available Scrapy objects: [s] crawler [s] item {} [s] request [s] response <200 https://www.reddit.com/> [s] settings [s] spider [s] Useful shortcuts: [s] shelp() Shell help (print this help) [s] fetch(req_or_url) Fetch request (or URL) and update local objects [s] view(response) View response in a browser >> response.xpath(”//title/text()”).extract() [u”reddit: the front page of the internet”] >> request = request.replace(method=”POST”) >> fetch(request) [s] Available Scrapy objects: [s] crawler … Invoking the Shell from Spiders to Inspect Responses You can inspect the responses which are processed from the spider, only if you are expecting to get that response. For instance − import scrapy class SpiderDemo(scrapy.Spider): name = “spiderdemo” start_urls = [ “http://mysite.com”, “http://mysite1.org”, “http://mysite2.net”, ] def parse(self, response): # You can inspect one specific response if “.net” in response.url: from scrapy.shell import inspect_response inspect_response(response, self) As shown in the above code, you can invoke the shell from spiders to inspect the responses using the following function − scrapy.shell.inspect_response Now run the spider, and you will get the following screen − 2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200) (referer: None) 2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200) (referer: None) 2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200) (referer: None) [s] Available Scrapy objects: [s] crawler … >> response.url ”http://mysite2.org” You can examine whether the extracted code is working using the following code − >> response.xpath(”//div[@class = “val”]”) It displays the output as [] The above line has displayed only a blank output. Now you can invoke the shell to inspect the response as follows − >> view(response) It displays the response as True Print Page Previous Next Advertisements ”;

Scrapy – Item Loaders

Scrapy – Item Loaders ”; Previous Next Description Item loaders provide a convenient way to fill the items that are scraped from the websites. Declaring Item Loaders The declaration of Item Loaders is like Items. For example − from scrapy.loader import ItemLoader from scrapy.loader.processors import TakeFirst, MapCompose, Join class DemoLoader(ItemLoader): default_output_processor = TakeFirst() title_in = MapCompose(unicode.title) title_out = Join() size_in = MapCompose(unicode.strip) # you can continue scraping here In the above code, you can see that input processors are declared using _in suffix and output processors are declared using _out suffix. The ItemLoader.default_input_processor and ItemLoader.default_output_processor attributes are used to declare default input/output processors. Using Item Loaders to Populate Items To use Item Loader, first instantiate with dict-like object or without one where the item uses Item class specified in ItemLoader.default_item_class attribute. You can use selectors to collect values into the Item Loader. You can add more values in the same item field, where Item Loader will use an appropriate handler to add these values. The following code demonstrates how items are populated using Item Loaders − from scrapy.loader import ItemLoader from demoproject.items import Demo def parse(self, response): l = ItemLoader(item = Product(), response = response) l.add_xpath(“title”, “//div[@class = ”product_title”]”) l.add_xpath(“title”, “//div[@class = ”product_name”]”) l.add_xpath(“desc”, “//div[@class = ”desc”]”) l.add_css(“size”, “div#size]”) l.add_value(“last_updated”, “yesterday”) return l.load_item() As shown above, there are two different XPaths from which the title field is extracted using add_xpath() method − 1. //div[@class = “product_title”] 2. //div[@class = “product_name”] Thereafter, a similar request is used for desc field. The size data is extracted using add_css() method and last_updated is filled with a value “yesterday” using add_value() method. Once all the data is collected, call ItemLoader.load_item() method which returns the items filled with data extracted using add_xpath(), add_css() and add_value() methods. Input and Output Processors Each field of an Item Loader contains one input processor and one output processor. When data is extracted, input processor processes it and its result is stored in ItemLoader. Next, after collecting the data, call ItemLoader.load_item() method to get the populated Item object. Finally, you can assign the result of the output processor to the item. The following code demonstrates how to call input and output processors for a specific field − l = ItemLoader(Product(), some_selector) l.add_xpath(“title”, xpath1) # [1] l.add_xpath(“title”, xpath2) # [2] l.add_css(“title”, css) # [3] l.add_value(“title”, “demo”) # [4] return l.load_item() # [5] Line 1 − The data of title is extracted from xpath1 and passed through the input processor and its result is collected and stored in ItemLoader. Line 2 − Similarly, the title is extracted from xpath2 and passed through the same input processor and its result is added to the data collected for [1]. Line 3 − The title is extracted from css selector and passed through the same input processor and the result is added to the data collected for [1] and [2]. Line 4 − Next, the value “demo” is assigned and passed through the input processors. Line 5 − Finally, the data is collected internally from all the fields and passed to the output processor and the final value is assigned to the Item. Declaring Input and Output Processors The input and output processors are declared in the ItemLoader definition. Apart from this, they can also be specified in the Item Field metadata. For example − import scrapy from scrapy.loader.processors import Join, MapCompose, TakeFirst from w3lib.html import remove_tags def filter_size(value): if value.isdigit(): return value class Item(scrapy.Item): name = scrapy.Field( input_processor = MapCompose(remove_tags), output_processor = Join(), ) size = scrapy.Field( input_processor = MapCompose(remove_tags, filter_price), output_processor = TakeFirst(), ) >>> from scrapy.loader import ItemLoader >>> il = ItemLoader(item = Product()) >>> il.add_value(”title”, [u”Hello”, u”<strong>world</strong>”]) >>> il.add_value(”size”, [u”<span>100 kg</span>”]) >>> il.load_item() It displays an output as − {”title”: u”Hello world”, ”size”: u”100 kg”} Item Loader Context The Item Loader Context is a dict of arbitrary key values shared among input and output processors. For example, assume you have a function parse_length − def parse_length(text, loader_context): unit = loader_context.get(”unit”, ”cm”) # You can write parsing code of length here return parsed_length By receiving loader_context arguements, it tells the Item Loader it can receive Item Loader context. There are several ways to change the value of Item Loader context − Modify current active Item Loader context − loader = ItemLoader (product) loader.context [“unit”] = “mm” On Item Loader instantiation − loader = ItemLoader(product, unit = “mm”) On Item Loader declaration for input/output processors that instantiates with Item Loader context − class ProductLoader(ItemLoader): length_out = MapCompose(parse_length, unit = “mm”) ItemLoader Objects It is an object which returns a new item loader to populate the given item. It has the following class − class scrapy.loader.ItemLoader([item, selector, response, ]**kwargs) The following table shows the parameters of ItemLoader objects − Sr.No Parameter & Description 1 item It is the item to populate by calling add_xpath(), add_css() or add_value(). 2 selector It is used to extract data from websites. 3 response It is used to construct selector using default_selector_class. Following table shows the methods of ItemLoader objects − Sr.No Method & Description Example 1 get_value(value, *processors, **kwargs) By a given processor and keyword arguments, the value is processed by get_value() method. >>> from scrapy.loader.processors import TakeFirst >>> loader.get_value(u”title: demoweb”, TakeFirst(), unicode.upper, re = ”title: (.+)”) ”DEMOWEB` 2 add_value(field_name, value, *processors, **kwargs) It processes the value and adds to the field where it is first passed through get_value by giving processors and keyword arguments before passing through field input processor. loader.add_value(”title”, u”DVD”) loader.add_value(”colors”, [u”black”, u”white”]) loader.add_value(”length”, u”80”) loader.add_value(”price”, u”2500”) 3 replace_value(field_name, value, *processors, **kwargs) It replaces the collected data with a new value. loader.replace_value(”title”, u”DVD”) loader.replace_value(”colors”, [u”black”, u”white”]) loader.replace_value(”length”, u”80”) loader.replace_value(”price”, u”2500”) 4 get_xpath(xpath, *processors, **kwargs) It is used to extract unicode strings by giving processors and keyword arguments by receiving XPath. # HTML code: <div class = “item-name”>DVD</div> loader.get_xpath(“//div[@class = ”item-name”]”) # HTML code: <div id = “length”>the length is 45cm</div> loader.get_xpath(“//div[@id = ”length”]”, TakeFirst(), re = “the length is (.*)”) 5 add_xpath(field_name, xpath, *processors, **kwargs) It receives XPath to the field

Scrapy – Link Extractors

Scrapy – Link Extractors ”; Previous Next Description As the name itself indicates, Link Extractors are the objects that are used to extract links from web pages using scrapy.http.Response objects. In Scrapy, there are built-in extractors such as scrapy.linkextractors import LinkExtractor. You can customize your own link extractor according to your needs by implementing a simple interface. Every link extractor has a public method called extract_links which includes a Response object and returns a list of scrapy.link.Link objects. You can instantiate the link extractors only once and call the extract_links method various times to extract links with different responses. The CrawlSpiderclass uses link extractors with a set of rules whose main purpose is to extract links. Built-in Link Extractor”s Reference Normally link extractors are grouped with Scrapy and are provided in scrapy.linkextractors module. By default, the link extractor will be LinkExtractor which is equal in functionality with LxmlLinkExtractor − from scrapy.linkextractors import LinkExtractor LxmlLinkExtractor class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow = (), deny = (), allow_domains = (), deny_domains = (), deny_extensions = None, restrict_xpaths = (), restrict_css = (), tags = (”a”, ”area”), attrs = (”href”, ), canonicalize = True, unique = True, process_value = None) The LxmlLinkExtractor is a highly recommended link extractor, because it has handy filtering options and it is used with lxml’s robust HTMLParser. Sr.No Parameter & Description 1 allow (a regular expression (or list of)) It allows a single expression or group of expressions that should match the url which is to be extracted. If it is not mentioned, it will match all the links. 2 deny (a regular expression (or list of)) It blocks or excludes a single expression or group of expressions that should match the url which is not to be extracted. If it is not mentioned or left empty, then it will not eliminate the undesired links. 3 allow_domains (str or list) It allows a single string or list of strings that should match the domains from which the links are to be extracted. 4 deny_domains (str or list) It blocks or excludes a single string or list of strings that should match the domains from which the links are not to be extracted. 5 deny_extensions (list) It blocks the list of strings with the extensions when extracting the links. If it is not set, then by default it will be set to IGNORED_EXTENSIONS which contains predefined list in scrapy.linkextractors package. 6 restrict_xpaths (str or list) It is an XPath list region from where the links are to be extracted from the response. If given, the links will be extracted only from the text, which is selected by XPath. 7 restrict_css (str or list) It behaves similar to restrict_xpaths parameter which will extract the links from the CSS selected regions inside the response. 8 tags (str or list) A single tag or a list of tags that should be considered when extracting the links. By default, it will be (’a’, ’area’). 9 attrs (list) A single attribute or list of attributes should be considered while extracting links. By default, it will be (’href’,). 10 canonicalize (boolean) The extracted url is brought to standard form using scrapy.utils.url.canonicalize_url. By default, it will be True. 11 unique (boolean) It will be used if the extracted links are repeated. 12 process_value (callable) It is a function which receives a value from scanned tags and attributes. The value received may be altered and returned or else nothing will be returned to reject the link. If not used, by default it will be lambda x: x. Example The following code is used to extract the links − <a href = “javascript:goToPage(”../other/page.html”); return false”>Link text</a> The following code function can be used in process_value − def process_value(val): m = re.search(“javascript:goToPage(”(.*?)””, val) if m: return m.group(1) Print Page Previous Next Advertisements ”;

Scrapy – Item Pipeline

Scrapy – Item Pipeline ”; Previous Next Description Item Pipeline is a method where the scrapped items are processed. When an item is sent to the Item Pipeline, it is scraped by a spider and processed using several components, which are executed sequentially. Whenever an item is received, it decides either of the following action − Keep processing the item. Drop it from pipeline. Stop processing the item. Item pipelines are generally used for the following purposes − Storing scraped items in database. If the received item is repeated, then it will drop the repeated item. It will check whether the item is with targeted fields. Clearing HTML data. Syntax You can write the Item Pipeline using the following method − process_item(self, item, spider) The above method contains following parameters − Item (item object or dictionary) − It specifies the scraped item. spider (spider object) − The spider which scraped the item. You can use additional methods given in the following table − Sr.No Method & Description Parameters 1 open_spider(self, spider) It is selected when spider is opened. spider (spider object) − It refers to the spider which was opened. 2 close_spider(self, spider) It is selected when spider is closed. spider (spider object) − It refers to the spider which was closed. 3 from_crawler(cls, crawler) With the help of crawler, the pipeline can access the core components such as signals and settings of Scrapy. crawler (Crawler object) − It refers to the crawler that uses this pipeline. Example Following are the examples of item pipeline used in different concepts. Dropping Items with No Tag In the following code, the pipeline balances the (price) attribute for those items that do not include VAT (excludes_vat attribute) and ignore those items which do not have a price tag − from Scrapy.exceptions import DropItem class PricePipeline(object): vat = 2.25 def process_item(self, item, spider): if item[”price”]: if item[”excludes_vat”]: item[”price”] = item[”price”] * self.vat return item else: raise DropItem(“Missing price in %s” % item) Writing Items to a JSON File The following code will store all the scraped items from all spiders into a single items.jl file, which contains one item per line in a serialized form in JSON format. The JsonWriterPipeline class is used in the code to show how to write item pipeline − import json class JsonWriterPipeline(object): def __init__(self): self.file = open(”items.jl”, ”wb”) def process_item(self, item, spider): line = json.dumps(dict(item)) + “n” self.file.write(line) return item Writing Items to MongoDB You can specify the MongoDB address and database name in Scrapy settings and MongoDB collection can be named after the item class. The following code describes how to use from_crawler() method to collect the resources properly − import pymongo class MongoPipeline(object): collection_name = ”Scrapy_list” def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri = crawler.settings.get(”MONGO_URI”), mongo_db = crawler.settings.get(”MONGO_DB”, ”lists”) ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): self.db[self.collection_name].insert(dict(item)) return item Duplicating Filters A filter will check for the repeated items and it will drop the already processed items. In the following code, we have used a unique id for our items, but spider returns many items with the same id − from scrapy.exceptions import DropItem class DuplicatesPipeline(object): def __init__(self): self.ids_seen = set() def process_item(self, item, spider): if item[”id”] in self.ids_seen: raise DropItem(“Repeated items found: %s” % item) else: self.ids_seen.add(item[”id”]) return item Activating an Item Pipeline You can activate an Item Pipeline component by adding its class to the ITEM_PIPELINES setting as shown in the following code. You can assign integer values to the classes in the order in which they run (the order can be lower valued to higher valued classes) and values will be in the 0-1000 range. ITEM_PIPELINES = { ”myproject.pipelines.PricePipeline”: 100, ”myproject.pipelines.JsonWriterPipeline”: 600, } Print Page Previous Next Advertisements ”;