Scrapy is a framework for building web crawlers and includes an API that can be used directly from a python script. The framework includes many components and options that manage the details of requesting pages from websites and collecting and storing the desired data.
The typical way to run scrapy is to use the scrapy framework to build a project in which we develop the code to do web scraping or crawling. In this article, I’ll begin with a small working example using the framework, illustrating the typical workflow. Then I’ll show you how to call the spider directly in a python script.
This minimal spider scrapes http://quotes.toscrape.com. This site is specifically for practicing web scraping.
The Scrapy Framework
In the normal scrapy workflow, we begin by starting a project with scrapy’s startproject
command.
(scrapy_new) saus@megux:~/scrapy_new/article$ cd projects (scrapy_new) saus@megux:~/scrapy_new/article/projects$ scrapy startproject spiderdemo New Scrapy project 'spiderdemo', using template directory '/home/saus/anaconda3/envs/scrapy_new/lib/python3.8/site-packages/scrapy/templates/project', created in: /home/saus/scrapy_new/article/projects/spiderdemo You can start your first spider with: cd spiderdemo scrapy genspider example example.com
This will create the following structure in a new directory with the same name as the project.
. └── spiderdemo ├── scrapy.cfg └── spiderdemo ├── __init__.py ├── items.py ├── middlewares.py ├── pipelines.py ├── settings.py └── spiders └── __init__.py
You would then use scrapy’s genspider
command to create a spider template to edit as follows:
... (scrapy_new) saus@megux:~/scrapy_new/article/projects/spiderdemo$ scrapy genspider funny quotes.scrape.com Created spider 'funny' using template 'basic' in module: Spiderdemo.spiders.funny
This creates the following spider code in the spiders directory.
import scrapy class FunnySpider(scrapy.Spider): name = 'funny' allowed_domains = ['quotes.scrape.com'] start_urls = ['http://quotes.scrape.com/'] def parse(self, response): pass
This defines the class FunnySpider
which inherits from scrapy.Spider
, the basic spider class provided by the scrapy API and sets a few important instance variables.
Now we edit the spider to create its behavior. Here’s the edited spider with an explanation of the changes.
import scrapy class FunnySpider(scrapy.Spider): name = 'funny' start_urls = ['http://quotes.toscrape.com/tag/humor/'] def parse(self, response): for quote in response.css('div.quote'): yield { 'author': quote.xpath('span/small/text()').get(), 'text': quote.css('span.text::text').get(), } next_page = response.css('li.next a::attr("href")').get() if next_page is not None: yield response.follow(next_page, self.parse)
The Modifications
- I’ve changed the first element of
start_urls
to ‘http://quotes.toscrape.com/tag/humor/‘. This restricts the spider to scrape only the quotes that have the tag ‘humor’, rather than all the quotes. - I’ve filled in the parse method. This is where the work of examining the HTTP from the web page gets done. A lot happens behind the scenes here. The following is an outline of the major actions.
- Scrapy makes an HTTP GET request to quotes.toscrape.com
- It captures the response as a
scrapy.http.response.html.HtmlResponse
. - It passes the response object to the default callback method (parse)
- The
parse()
function uses CSS and XPath selectors to locate the desired information and captures them for return. - Parse looks for the next page (using a CSS selector). If it finds it, it calls
follow(
), which creates a request object. - Parse returns control to scrapy which receives the scraped information and the new request that is then queued for transmission by scrapy’s scheduler
- The process repeats until there is no longer a next page to fetch.
Running the spider from the scrapy framework
Now that the spider is ready, we can run it from the scrapy framework like this.
(scrapy_new) saus@megux:~/scrapy_new/article/projects/spiderdemo$ scrapy crawl funny --logfile=spiderlog
If we leave off the ‘--logfile’
it will print the log on the terminal. After running the command, the file spiderlog will show all of scrapy’s log message (there are many and give you some notion of all the controls and settings that scrapy has). To save the output as JSON, use the -o
flag like this.
(scrapy_new) saus@megux:~/scrapy_new/article/projects/spiderdemo$ scrapy crawl funny -o out.json
If we look at the output file we see the following.
[ {"author": "Jane Austen", "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"}, {"author": "Steve Martin", "text": "\u201cA day without sunshine is like, you know, night.\u201d"}, {"author": "Garrison Keillor", "text": "\u201cAnyone who thinks sitting in church can make you a Ch ristian must also think that sitting in a garage can make you a car.\u201d"}, {"author": "Jim Henson", "text": "\u201cBeauty is in the eye of the beholder and it may be necessar y from time to time to give a stupid or misinformed beholder a black eye.\u201d"}, {"author": "Charles M. Schulz", "text": "\u201cAll you need is love. But a little chocolate now and then doesn't hurt.\u201d"}, {"author": "Suzanne Collins", "text": "\u201cRemember, we're madly in love, so it's all right to ki ss me anytime you feel like it.\u201d"}, {"author": "Charles Bukowski", "text": "\u201cSome people never go crazy. What truly horrible lives they must lead.\u201d"}, {"author": "Terry Pratchett", "text": "\u201cThe trouble with having an open mind, of course, is th at people will insist on coming along and trying to put things in it.\u201d"}, {"author": "Dr. Seuss", "text": "\u201cThink left and think right and think low and think high. Oh, the thinks you can think up if only you try!\u201d"}, {"author": "George Carlin", "text": "\u201cThe reason I talk to myself is because I\u2019m the only one whose answers I accept.\u201d"}, {"author": "W.C. Fields", "text": "\u201cI am free of all prejudice. I hate everyone equally. \u201 d"}, {"author": "Jane Austen", "text": "\u201cA lady's imagination is very rapid; it jumps from admirati on to love, from love to matrimony in a moment.\u201d"} ]
So, the spider captures the quotes as desired.
How to run a spider directly from a shell script
This is probably the answer that brought you to this page. The following shows how to run the spider defined above directly from a shell script.
import scrapy from scrapy.crawler import CrawlerProcess # define spider class QuotesSpider(scrapy.Spider): name = 'quotes' start_urls = ['http://quotes.toscrape.com/tag/humor/'] def parse(self, response): print(type(response)) for quote in response.css('div.quote'): yield { 'author': quote.xpath('span/small/text()').get(), 'text': quote.css('span.text::text').get(), } next_page = response.css('li.next a::attr("href")').get() if next_page is not None: yield response.follow(next_page, self.parse) process = CrawlerProcess(settings={ "FEEDS": { "out.json": {"format": "json"}, }, }) process.crawl(QuotesSpider) process.start()
The spider class definition here is exactly the same as shown about. What’s different is that we import the CrawlerProcess
from and instantiate it then use it to call our spider with the crawl method of the CrawlerProcess
object. The output file is specified in the settings argument to CrawlerProcess
.
Conclusion
It should be understood that the framework provides support for different kinds of output, logging levels, rate-limiting and so on. If you want to take advantage of these capabilities using the standard well-documented configuration files then the framework is available if you build a scrapy project. See this link for additional information. You will also find excellent documentation and a scrapy tutorial at docs.scrapy.org.