scrapy Archives - Donotsad where can learn any thing work project and make money

Aug 10

Scrapy – Crawling

Scrapy – Crawling ”; Previous Next Description To execute your spider, run the following command within your first_scrapy directory − scrapy crawl first Where, first is the name of the spider specified while creating the spider. Once the spider crawls, you can see the following output − 2016-08-09 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial) 2016-08-09 18:13:07-0400 [scrapy] INFO: Optional features available: … 2016-08-09 18:13:07-0400 [scrapy] INFO: Overridden settings: {} 2016-08-09 18:13:07-0400 [scrapy] INFO: Enabled extensions: … 2016-08-09 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: … 2016-08-09 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: … 2016-08-09 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: … 2016-08-09 18:13:07-0400 [scrapy] INFO: Spider opened 2016-08-09 18:13:08-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None) 2016-08-09 18:13:09-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None) 2016-08-09 18:13:09-0400 [scrapy] INFO: Closing spider (finished) As you can see in the output, for each URL there is a log line which (referer: None) states that the URLs are start URLs and they have no referrers. Next, you should see two new files named Books.html and Resources.html are created in your first_scrapy directory. Print Page Previous Next Advertisements ”;

Aug 10

Scrapy – Settings

Scrapy – Settings ”; Previous Next Description The behavior of Scrapy components can be modified using Scrapy settings. The settings can also select the Scrapy project that is currently active, in case you have multiple Scrapy projects. Designating the Settings You must notify Scrapy which setting you are using when you scrap a website. For this, environment variable SCRAPY_SETTINGS_MODULE should be used and its value should be in Python path syntax. Populating the Settings The following table shows some of the mechanisms by which you can populate the settings − Sr.No Mechanism & Description 1 Command line options Here, the arguments that are passed takes highest precedence by overriding other options. The -s is used to override one or more settings. scrapy crawl myspider -s LOG_FILE = scrapy.log 2 Settings per-spider Spiders can have their own settings that overrides the project ones by using attribute custom_settings. class DemoSpider(scrapy.Spider): name = ”demo” custom_settings = { ”SOME_SETTING”: ”some value”, } 3 Project settings module Here, you can populate your custom settings such as adding or modifying the settings in the settings.py file. 4 Default settings per-command Each Scrapy tool command defines its own settings in the default_settings attribute, to override the global default settings. 5 Default global settings These settings are found in the scrapy.settings.default_settings module. Access Settings They are available through self.settings and set in the base spider after it is initialized. The following example demonstrates this. class DemoSpider(scrapy.Spider): name = ”demo” start_urls = [”http://example.com”] def parse(self, response): print(“Existing settings: %s” % self.settings.attributes.keys()) To use settings before initializing the spider, you must override from_crawler method in the _init_() method of your spider. You can access settings through attribute scrapy.crawler.Crawler.settings passed to from_crawler method. The following example demonstrates this. class MyExtension(object): def __init__(self, log_is_enabled = False): if log_is_enabled: print(“Enabled log”) @classmethod def from_crawler(cls, crawler): settings = crawler.settings return cls(settings.getbool(”LOG_ENABLED”)) Rationale for Setting Names Setting names are added as a prefix to the component they configure. For example, for robots.txt extension, the setting names can be ROBOTSTXT_ENABLED, ROBOTSTXT_OBEY, ROBOTSTXT_CACHEDIR, etc. Built-in Settings Reference The following table shows the built-in settings of Scrapy − Sr.No Setting & Description 1 AWS_ACCESS_KEY_ID It is used to access Amazon Web Services. Default value: None 2 AWS_SECRET_ACCESS_KEY It is used to access Amazon Web Services. Default value: None 3 BOT_NAME It is the name of bot that can be used for constructing User-Agent. Default value: ”scrapybot” 4 CONCURRENT_ITEMS Maximum number of existing items in the Item Processor used to process parallely. Default value: 100 5 CONCURRENT_REQUESTS Maximum number of existing requests which Scrapy downloader performs. Default value: 16 6 CONCURRENT_REQUESTS_PER_DOMAIN Maximum number of existing requests that perform simultaneously for any single domain. Default value: 8 7 CONCURRENT_REQUESTS_PER_IP Maximum number of existing requests that performs simultaneously to any single IP. Default value: 0 8 DEFAULT_ITEM_CLASS It is a class used to represent items. Default value: ”scrapy.item.Item” 9 DEFAULT_REQUEST_HEADERS It is a default header used for HTTP requests of Scrapy. Default value − { ”Accept”: ”text/html,application/xhtml+xml,application/xml;q=0.9, */*;q=0.8”, ”Accept-Language”: ”en”, } 10 DEPTH_LIMIT The maximum depth for a spider to crawl any site. Default value: 0 11 DEPTH_PRIORITY It is an integer used to alter the priority of request according to the depth. Default value: 0 12 DEPTH_STATS It states whether to collect depth stats or not. Default value: True 13 DEPTH_STATS_VERBOSE This setting when enabled, the number of requests is collected in stats for each verbose depth. Default value: False 14 DNSCACHE_ENABLED It is used to enable DNS in memory cache. Default value: True 15 DNSCACHE_SIZE It defines the size of DNS in memory cache. Default value: 10000 16 DNS_TIMEOUT It is used to set timeout for DNS to process the queries. Default value: 60 17 DOWNLOADER It is a downloader used for the crawling process. Default value: ”scrapy.core.downloader.Downloader” 18 DOWNLOADER_MIDDLEWARES It is a dictionary holding downloader middleware and their orders. Default value: {} 19 DOWNLOADER_MIDDLEWARES_BASE It is a dictionary holding downloader middleware that is enabled by default. Default value − { ”scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware”: 100, } 20 DOWNLOADER_STATS This setting is used to enable the downloader stats. Default value: True 21 DOWNLOAD_DELAY It defines the total time for downloader before it downloads the pages from the site. Default value: 0 22 DOWNLOAD_HANDLERS It is a dictionary with download handlers. Default value: {} 23 DOWNLOAD_HANDLERS_BASE It is a dictionary with download handlers that is enabled by default. Default value − { ”file”: ”scrapy.core.downloader.handlers.file.FileDownloadHandler”, } 24 DOWNLOAD_TIMEOUT It is the total time for downloader to wait before it times out. Default value: 180 25 DOWNLOAD_MAXSIZE It is the maximum size of response for the downloader to download. Default value: 1073741824 (1024MB) 26 DOWNLOAD_WARNSIZE It defines the size of response for downloader to warn. Default value: 33554432 (32MB) 27 DUPEFILTER_CLASS It is a class used for detecting and filtering of requests that are duplicate. Default value: ”scrapy.dupefilters.RFPDupeFilter” 28 DUPEFILTER_DEBUG This setting logs all duplicate filters when set to true. Default value: False 29 EDITOR It is used to edit spiders using the edit command. Default value: Depends on the environment 30 EXTENSIONS It is a dictionary having extensions that are enabled in the project. Default value: {} 31 EXTENSIONS_BASE It is a dictionary having built-in extensions. Default value: { ”scrapy.extensions.corestats.CoreStats”: 0, } 32 FEED_TEMPDIR It is a directory used to set the custom folder where crawler temporary files can be stored. 33 ITEM_PIPELINES It is a dictionary having pipelines. Default value: {} 34 LOG_ENABLED It defines if the logging is to be enabled. Default value: True 35 LOG_ENCODING It defines the type of encoding to be used for logging. Default value: ”utf-8” 36 LOG_FILE It is the name of the file to be used for the output of logging. Default value: None 37 LOG_FORMAT It is a string using which the log messages can be formatted. Default value: ”%(asctime)s [%(name)s] %(levelname)s: %(message)s” 38 LOG_DATEFORMAT It is a string using which date/time can be formatted. Default value: ”%Y-%m-%d %H:%M:%S” 39 LOG_LEVEL It defines minimum log level. Default value: ”DEBUG” 40 LOG_STDOUT

Aug 10

Scrapy – Spiders

Scrapy – Spiders ”; Previous Next Description Spider is a class responsible for defining how to follow the links through a website and extract the information from the pages. The default spiders of Scrapy are as follows − scrapy.Spider It is a spider from which every other spiders must inherit. It has the following class − class scrapy.spiders.Spider The following table shows the fields of scrapy.Spider class − Sr.No Field & Description 1 name It is the name of your spider. 2 allowed_domains It is a list of domains on which the spider crawls. 3 start_urls It is a list of URLs, which will be the roots for later crawls, where the spider will begin to crawl from. 4 custom_settings These are the settings, when running the spider, will be overridden from project wide configuration. 5 crawler It is an attribute that links to Crawler object to which the spider instance is bound. 6 settings These are the settings for running a spider. 7 logger It is a Python logger used to send log messages. 8 from_crawler(crawler,*args,**kwargs) It is a class method, which creates your spider. The parameters are − crawler − A crawler to which the spider instance will be bound. args(list) − These arguments are passed to the method _init_(). kwargs(dict) − These keyword arguments are passed to the method _init_(). 9 start_requests() When no particular URLs are specified and the spider is opened for scrapping, Scrapy calls start_requests() method. 10 make_requests_from_url(url) It is a method used to convert urls to requests. 11 parse(response) This method processes the response and returns scrapped data following more URLs. 12 log(message[,level,component]) It is a method that sends a log message through spiders logger. 13 closed(reason) This method is called when the spider closes. Spider Arguments Spider arguments are used to specify start URLs and are passed using crawl command with -a option, shown as follows − scrapy crawl first_scrapy -a group = accessories The following code demonstrates how a spider receives arguments − import scrapy class FirstSpider(scrapy.Spider): name = “first” def __init__(self, group = None, *args, **kwargs): super(FirstSpider, self).__init__(*args, **kwargs) self.start_urls = [“http://www.example.com/group/%s” % group] Generic Spiders You can use generic spiders to subclass your spiders from. Their aim is to follow all links on the website based on certain rules to extract data from all pages. For the examples used in the following spiders, let’s assume we have a project with the following fields − import scrapy from scrapy.item import Item, Field class First_scrapyItem(scrapy.Item): product_title = Field() product_link = Field() product_description = Field() CrawlSpider CrawlSpider defines a set of rules to follow the links and scrap more than one page. It has the following class − class scrapy.spiders.CrawlSpider Following are the attributes of CrawlSpider class − rules It is a list of rule objects that defines how the crawler follows the link. The following table shows the rules of CrawlSpider class − Sr.No Rule & Description 1 LinkExtractor It specifies how spider follows the links and extracts the data. 2 callback It is to be called after each page is scraped. 3 follow It specifies whether to continue following links or not. parse_start_url(response) It returns either item or request object by allowing to parse initial responses. Note − Make sure you rename parse function other than parse while writing the rules because the parse function is used by CrawlSpider to implement its logic. Let’s take a look at the following example, where spider starts crawling demoexample.com”s home page, collecting all pages, links, and parses with the parse_items method − import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class DemoSpider(CrawlSpider): name = “demo” allowed_domains = [“www.demoexample.com”] start_urls = [“http://www.demoexample.com”] rules = ( Rule(LinkExtractor(allow =(), restrict_xpaths = (“//div[@class = ”next”]”,)), callback = “parse_item”, follow = True), ) def parse_item(self, response): item = DemoItem() item[“product_title”] = response.xpath(“a/text()”).extract() item[“product_link”] = response.xpath(“a/@href”).extract() item[“product_description”] = response.xpath(“div[@class = ”desc”]/text()”).extract() return items XMLFeedSpider It is the base class for spiders that scrape from XML feeds and iterates over nodes. It has the following class − class scrapy.spiders.XMLFeedSpider The following table shows the class attributes used to set an iterator and a tag name − Sr.No Attribute & Description 1 iterator It defines the iterator to be used. It can be either iternodes, html or xml. Default is iternodes. 2 itertag It is a string with node name to iterate. 3 namespaces It is defined by list of (prefix, uri) tuples that automatically registers namespaces using register_namespace() method. 4 adapt_response(response) It receives the response and modifies the response body as soon as it arrives from spider middleware, before spider starts parsing it. 5 parse_node(response,selector) It receives the response and a selector when called for each node matching the provided tag name. Note − Your spider won”t work if you don”t override this method. 6 process_results(response,results) It returns a list of results and response returned by the spider. CSVFeedSpider It iterates through each of its rows, receives a CSV file as a response, and calls parse_row() method. It has the following class − class scrapy.spiders.CSVFeedSpider The following table shows the options that can be set regarding the CSV file − Sr.No Option & Description 1 delimiter It is a string containing a comma(”,”) separator for each field. 2 quotechar It is a string containing quotation mark(”””) for each field. 3 headers It is a list of statements from where the fields can be extracted. 4 parse_row(response,row) It receives a response and each row along with a key for header. CSVFeedSpider Example from scrapy.spiders import CSVFeedSpider from demoproject.items import DemoItem class DemoSpider(CSVFeedSpider): name = “demo” allowed_domains = [“www.demoexample.com”] start_urls = [“http://www.demoexample.com/feed.csv”] delimiter = “;” quotechar = “”” headers = [“product_title”, “product_link”, “product_description”] def parse_row(self, response, row): self.logger.info(“This is row: %r”, row) item = DemoItem() item[“product_title”] = row[“product_title”] item[“product_link”] = row[“product_link”] item[“product_description”] = row[“product_description”] return item SitemapSpider SitemapSpider with the help of Sitemaps crawl a website by locating the URLs from robots.txt. It has the following class − class scrapy.spiders.SitemapSpider The following table shows the

Aug 10

Scrapy – Web Services

Scrapy – Web Services ”; Previous Next Description A running Scrapy web crawler can be controlled via JSON-RPC. It is enabled by JSONRPC_ENABLED setting. This service provides access to the main crawler object via JSON-RPC 2.0 protocol. The endpoint for accessing the crawler object is − http://localhost:6080/crawler The following table contains some of the settings which show the behavior of web service − Sr.No Setting & Description Default Value 1 JSONRPC_ENABLED This refers to the boolean, which decides the web service along with its extension will be enabled or not. True 2 JSONRPC_LOGFILE This refers to the file used for logging HTTP requests made to the web service. If it is not set the standard Scrapy log will be used. None 3 JSONRPC_PORT This refers to the port range for the web service. If it is set to none, then the port will be dynamically assigned. [6080, 7030] 4 JSONRPC_HOST This refers to the interface the web service should listen on. ”127.0.0.1” Print Page Previous Next Advertisements ”;

Aug 10

Scrapy – Discussion

Discuss Scrapy ”; Previous Next Scrapy is a fast, open-source web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on XPath. Print Page Previous Next Advertisements ”;

Aug 10

Scrapy – Logging

Scrapy – Logging ”; Previous Next Description Logging means tracking of events, which uses built-in logging system and defines functions and classes to implement applications and libraries. Logging is a ready-to-use material, which can work with Scrapy settings listed in Logging settings. Scrapy will set some default settings and handle those settings with the help of scrapy.utils.log.configure_logging() when running commands. Log levels In Python, there are five different levels of severity on a log message. The following list shows the standard log messages in an ascending order − logging.DEBUG − for debugging messages (lowest severity) logging.INFO − for informational messages logging.WARNING − for warning messages logging.ERROR − for regular errors logging.CRITICAL − for critical errors (highest severity) How to Log Messages The following code shows logging a message using logging.info level. import logging logging.info(“This is an information”) The above logging message can be passed as an argument using logging.log shown as follows − import logging logging.log(logging.INFO, “This is an information”) Now, you can also use loggers to enclose the message using the logging helpers logging to get the logging message clearly shown as follows − import logging logger = logging.getLogger() logger.info(“This is an information”) There can be multiple loggers and those can be accessed by getting their names with the use of logging.getLogger function shown as follows. import logging logger = logging.getLogger(”mycustomlogger”) logger.info(“This is an information”) A customized logger can be used for any module using the __name__ variable which contains the module path shown as follows − import logging logger = logging.getLogger(__name__) logger.info(“This is an information”) Logging from Spiders Every spider instance has a logger within it and can used as follows − import scrapy class LogSpider(scrapy.Spider): name = ”logspider” start_urls = [”http://dmoz.com”] def parse(self, response): self.logger.info(”Parse function called on %s”, response.url) In the above code, the logger is created using the Spider’s name, but you can use any customized logger provided by Python as shown in the following code − import logging import scrapy logger = logging.getLogger(”customizedlogger”) class LogSpider(scrapy.Spider): name = ”logspider” start_urls = [”http://dmoz.com”] def parse(self, response): logger.info(”Parse function called on %s”, response.url) Logging Configuration Loggers are not able to display messages sent by them on their own. So they require “handlers” for displaying those messages and handlers will be redirecting these messages to their respective destinations such as files, emails, and standard output. Depending on the following settings, Scrapy will configure the handler for logger. Logging Settings The following settings are used to configure the logging − The LOG_FILE and LOG_ENABLED decide the destination for log messages. When you set the LOG_ENCODING to false, it won”t display the log output messages. The LOG_LEVEL will determine the severity order of the message; those messages with less severity will be filtered out. The LOG_FORMAT and LOG_DATEFORMAT are used to specify the layouts for all messages. When you set the LOG_STDOUT to true, all the standard output and error messages of your process will be redirected to log. Command-line Options Scrapy settings can be overridden by passing command-line arguments as shown in the following table − Sr.No Command & Description 1 –logfile FILE Overrides LOG_FILE 2 –loglevel/-L LEVEL Overrides LOG_LEVEL 3 –nolog Sets LOG_ENABLED to False scrapy.utils.log module This function can be used to initialize logging defaults for Scrapy. scrapy.utils.log.configure_logging(settings = None, install_root_handler = True) Sr.No Parameter & Description 1 settings (dict, None) It creates and configures the handler for root logger. By default, it is None. 2 install_root_handler (bool) It specifies to install root logging handler. By default, it is True. The above function − Routes warnings and twisted loggings through Python standard logging. Assigns DEBUG to Scrapy and ERROR level to Twisted loggers. Routes stdout to log, if LOG_STDOUT setting is true. Default options can be overridden using the settings argument. When settings are not specified, then defaults are used. The handler can be created for root logger, when install_root_handler is set to true. If it is set to false, then there will not be any log output set. When using Scrapy commands, the configure_logging will be called automatically and it can run explicitly, while running the custom scripts. To configure logging”s output manually, you can use logging.basicConfig() shown as follows − import logging from scrapy.utils.log import configure_logging configure_logging(install_root_handler = False) logging.basicConfig ( filename = ”logging.txt”, format = ”%(levelname)s: %(your_message)s”, level = logging.INFO ) Print Page Previous Next Advertisements ”;

Aug 10

Scrapy – Useful Resources

Scrapy – Useful Resources ”; Previous Next The following resources contain additional information on Scrapy. Please use them to get more in-depth knowledge on this. Useful Video Courses Scrapy Course: Python Web Scraping & Crawling for Beginners 28 Lectures 3.5 hours Attreya Bhatt More Detail Web Scraping using API, Beautiful Soup using Python 39 Lectures 3.5 hours Chandramouli Jayendran More Detail A-Z Python Bootcamp- Basics To Data Science (50+ Hours) Best Seller 436 Lectures 46 hours Chandramouli Jayendran More Detail Data Scraping and Data Mining from Beginner to Pro with Python 150 Lectures 13.5 hours Packt Publishing More Detail Data Scraping and Data Mining from Beginner to Professional 152 Lectures 14 hours AI Sciences More Detail 50 Hours of Big Data, PySpark, AWS, Scala and Scraping 622 Lectures 54.5 hours AI Sciences More Detail Print Page Previous Next Advertisements ”;

Aug 10

Scrapy – Extracting Items

Scrapy – Extracting Items ”; Previous Next Description For extracting data from web pages, Scrapy uses a technique called selectors based on XPath and CSS expressions. Following are some examples of XPath expressions − /html/head/title − This will select the <title> element, inside the <head> element of an HTML document. /html/head/title/text() − This will select the text within the same <title> element. //td − This will select all the elements from <td>. //div[@class = “slice”] − This will select all elements from div which contain an attribute class = “slice” Selectors have four basic methods as shown in the following table − Sr.No Method & Description 1 extract() It returns a unicode string along with the selected data. 2 re() It returns a list of unicode strings, extracted when the regular expression was given as argument. 3 xpath() It returns a list of selectors, which represents the nodes selected by the xpath expression given as an argument. 4 css() It returns a list of selectors, which represents the nodes selected by the CSS expression given as an argument. Using Selectors in the Shell To demonstrate the selectors with the built-in Scrapy shell, you need to have IPython installed in your system. The important thing here is, the URLs should be included within the quotes while running Scrapy; otherwise the URLs with ”&” characters won”t work. You can start a shell by using the following command in the project”s top level directory − scrapy shell “http://www.dmoz.org/Computers/Programming/Languages/Python/Books/” A shell will look like the following − [ … Scrapy log here … ] 2014-01-23 17:11:42-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>(referer: None) [s] Available Scrapy objects: [s] crawler <scrapy.crawler.Crawler object at 0x3636b50> [s] item {} [s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> [s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> [s] settings <scrapy.settings.Settings object at 0x3fadc50> [s] spider <Spider ”default” at 0x3cebf50> [s] Useful shortcuts: [s] shelp() Shell help (print this help) [s] fetch(req_or_url) Fetch request (or URL) and update local objects [s] view(response) View response in a browser In [1]: When shell loads, you can access the body or header by using response.body and response.header respectively. Similarly, you can run queries on the response using response.selector.xpath() or response.selector.css(). For instance − In [1]: response.xpath(”//title”) Out[1]: [<Selector xpath = ”//title” data = u”<title>My Book – Scrapy”>] In [2]: response.xpath(”//title”).extract() Out[2]: [u”<title>My Book – Scrapy: Index: Chapters</title>”] In [3]: response.xpath(”//title/text()”) Out[3]: [<Selector xpath = ”//title/text()” data = u”My Book – Scrapy: Index:”>] In [4]: response.xpath(”//title/text()”).extract() Out[4]: [u”My Book – Scrapy: Index: Chapters”] In [5]: response.xpath(”//title/text()”).re(”(w+):”) Out[5]: [u”Scrapy”, u”Index”, u”Chapters”] Extracting the Data To extract data from a normal HTML site, we have to inspect the source code of the site to get XPaths. After inspecting, you can see that the data will be in the ul tag. Select the elements within li tag. The following lines of code shows extraction of different types of data − For selecting data within li tag − response.xpath(”//ul/li”) For selecting descriptions − response.xpath(”//ul/li/text()”).extract() For selecting site titles − response.xpath(”//ul/li/a/text()”).extract() For selecting site links − response.xpath(”//ul/li/a/@href”).extract() The following code demonstrates the use of above extractors − import scrapy class MyprojectSpider(scrapy.Spider): name = “project” allowed_domains = [“dmoz.org”] start_urls = [ “http://www.dmoz.org/Computers/Programming/Languages/Python/Books/”, “http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/” ] def parse(self, response): for sel in response.xpath(”//ul/li”): title = sel.xpath(”a/text()”).extract() link = sel.xpath(”a/@href”).extract() desc = sel.xpath(”text()”).extract() print title, link, desc Print Page Previous Next Advertisements ”;

Aug 10

Scrapy – Using an Item

Scrapy – Using an Item ”; Previous Next Description Item objects are the regular dicts of Python. We can use the following syntax to access the attributes of the class − >>> item = DmozItem() >>> item[”title”] = ”sample title” >>> item[”title”] ”sample title” Add the above code to the following example − import scrapy from tutorial.items import DmozItem class MyprojectSpider(scrapy.Spider): name = “project” allowed_domains = [“dmoz.org”] start_urls = [ “http://www.dmoz.org/Computers/Programming/Languages/Python/Books/”, “http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/” ] def parse(self, response): for sel in response.xpath(”//ul/li”): item = DmozItem() item[”title”] = sel.xpath(”a/text()”).extract() item[”link”] = sel.xpath(”a/@href”).extract() item[”desc”] = sel.xpath(”text()”).extract() yield item The output of the above spider will be − [scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> {”desc”: [u” – By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.n], ”link”: [u”http://gnosis.cx/TPiP/”], ”title”: [u”Text Processing in Python”]} [scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> {”desc”: [u” – By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]n”], ”link”: [u”http://www.informit.com/store/product.aspx?isbn=0130211192”], ”title”: [u”XML Processing with Python”]} Print Page Previous Next Advertisements ”;

Aug 10

Scrapy – Sending an E-mail

Scrapy – Sending an E-mail ”; Previous Next Description Scrapy can send e-mails using its own facility called as Twisted non-blocking IO which keeps away from non-blocking IO of the crawler. You can configure the few settings of sending emails and provide simple API for sending attachments. There are two ways to instantiate the MailSender as shown in the following table − Sr.No Parameters Method 1 from scrapy.mail import MailSender mailer = MailSender() By using a standard constructor. 2 mailer = MailSender.from_settings(settings) By using Scrapy settings object. The following line sends an e-mail without attachments − mailer.send(to = [“[email protected]”], subject = “subject data”, body = “body data”, cc = [“[email protected]”]) MailSender Class Reference The MailSender class uses Twisted non-blocking IO for sending e-mails from Scrapy. class scrapy.mail.MailSender(smtphost = None, mailfrom = None, smtpuser = None, smtppass = None, smtpport = None) The following table shows the parameters used in MailSender class − Sr.No Parameter & Description 1 smtphost (str) The SMTP host is used for sending the emails. If not, then MAIL_HOST setting will be used. 2 mailfrom (str) The address of receiver is used to send the emails. If not, then MAIL_FROM setting will be used. 3 smtpuser It specifies the SMTP user. If it is not used, then MAIL_USER setting will be used and there will be no SMTP validation if is not mentioned. 4 smtppass (str) It specifies the SMTP pass for validation. 5 smtpport (int) It specifies the SMTP port for connection. 6 smtptls (boolean) It implements using the SMTP STARTTLS. 7 smtpssl (boolean) It administers using a safe SSL connection. Following two methods are there in the MailSender class reference as specified. First method, classmethod from_settings(settings) It incorporates by using the Scrapy settings object. It contains the following parameter − settings (scrapy.settings.Settings object) − It is treated as e-mail receiver. Another method, send(to, subject, body, cc = None, attachs = (), mimetype = ”text/plain”, charset = None) The following table contains the parameters of the above method − Sr.No Parameter & Description 1 to (list) It refers to the email receiver. 2 subject (str) It specifies the subject of the email. 3 cc (list) It refers to the list of receivers. 4 body (str) It refers to email body data. 5 attachs (iterable) It refers to the email”s attachment, mimetype of the attachment and name of the attachment. 6 mimetype (str) It represents the MIME type of the e-mail. 7 charset (str) It specifies the character encoding used for email contents. Mail Settings The following settings ensure that without writing any code, we can configure an e-mail using the MailSender class in the project. Sr.No Settings & Description Default Value 1 MAIL_FROM It refers to sender email for sending emails. ”scrapy@localhost” 2 MAIL_HOST It refers to SMTP host used for sending emails. ”localhost” 3 MAIL_PORT It specifies SMTP port to be used for sending emails. 25 4 MAIL_USER It refers to SMTP validation. There will be no validation, if this setting is set to disable. None 5 MAIL_PASS It provides the password used for SMTP validation. None 6 MAIL_TLS It provides the method of upgrading an insecure connection to a secure connection using SSL/TLS. False 7 MAIL_SSL It implements the connection using a SSL encrypted connection. False Print Page Previous Next Advertisements ”;