We live in a world that relies on data, massive amounts of data. This data is used in many areas of business, for example:
- Marketing & sales
- Competition research
- Lead generation
- Content aggregation
- Monitoring consumer sentiment
- Data Analytics and Data science
- AI Machine learning
- Real Estate
- Product and price data
Much of this data is available on the internet for people to read and compare through sites that specialize in the type of data they’re interested in. But not very efficient, not to mention time-consuming and very difficult to use in other programs. Web scraping is a way that makes extracting the data you need very fast and efficiently saving them in formats that can be used in other programs.
The purpose of this article is to get us up and running with Scrapy quickly. While Scrapy can handle both CSS and xpath
tags to get the data we want, we’ll be using CSS. The site we’re going to scrape is ‘Books to Scrape’ using Python, Web Developer Tools in Firefox, PyCharm, and Python package Scrapy.
Installing Scrapy on Pycharm
Install using the default settings, once these applications are installed, we need to create a project. To do this, open PyCharm and click on File → New Project…
, you see this:
I’ve named my project ‘scrapingProject’
but you can name it whatever you like, this will take some time to create. Once the project is created click on the Terminal tab and type in pip install scrapy
:
Creating a Scrapy Project in PyCharm
After Scrapy is installed we need to create a scrapy project using scrapy startproject <projectName>
I’m naming mine scrapeBooks
:
Creating the Scraping Spider
When the project creation is completed change directories in the terminal to the project folder (cd <projectName>
), this creates additional files needed to run the spider. Additionally, this is where we’ll be entering other needed commands. Now to create the spider, open the project folder right click on the spider.folder
select ‘New’ → ‘Python File
’ and create a new Python file:
Open the new python file enter the following:
# Import library import scrapy # Create Spider class class booksToScrape(scrapy.Spider): # Name of spider name = 'books' # Website you want to scrape start_urls = [ 'http://books.toscrape.com' ] # Parses the website def parse(self, response): pass
It should look like this:
We’re going to be scraping the title and price from ‘Books to Scrape‘ so let’s open Firefox and visit the site. Right-click on the title of a book and select ‘Inspect’ from the context menu.
Inspecting the Website to Be Scraped
Inspecting the site, we see that the tag we need to use to get the title of the book is located under <h3><a>
tag. To make sure this will give us all the titles on the page use the ‘Search’ in the Inspector. We don’t have to use the whole path to get all the titles for the page, use a[title]
in the search. The ‘a
’ identifies the tag and the [ ]
separates the title from the href
. There will be 20 results found on the page, by pressing ‘Enter’ you can see that all the book titles on this page cycling through.
To find out if this selector will work in scrapy we’re going to use the scrapy shell. Go back to the PyCharm Terminal and enter scrapy shell to bring up the shell, this allows us to interact directly with the page. Retrieve the web page using fetch(‘http://books.toscrape.com’):
Enter into the prompt response.css('a[title]').get()
to see what we get.
Close but we’re getting only one title and not just the title but also the catalogue link too. We need to tell scrapy to grab just the title text of all the books on this page. To do this we’ll use ::text
to get the title text and .getall()
for all the books. The new command is response.css('a[title]::text').getall()
:
Much better, we now have just all the titles from the page. Let’s see if we can make it look better by using a for loop:
for title in response.css('a[title]::text').getall(): print(title)
That works, now let’s add it to the spider. Just copy the commands and place them below the parse command:
Exiting the Scrapy Shell
Now to crawl the site, first, we must exit the scrapy shell, to do that use exit()
. Next use the name of the spider, like this scrapy crawl books
to crawl the site. You don’t use the file name to crawl the page because the framework that scrapy uses looks for the name of the spider, not the file name, and knows where to look.
Crawling 101
Now that we have titles, we need the prices, using the same method as before right-click on the price and inspect it.
The tag we want for the price of a book is .price_color
. Using the previous commands, we just swap out 'a[title]'
for ‘.price_color’
. Using the scrapy shell we get this:
Now we have the tags needed to grab just the titles and prices from the page, we need to find the common element holding them together. While looking at the earlier elements, you may have noticed that they’re grouped under .product_pod
with other attributes. To separate these elements from the others we’ll just tweak the code a bit:
for i in response.css('.product_pod'): title = i.css('a[title]::text').getall() price = i.css('.price_color::text').getall() print(title, price)
As you can see, we’re calling the tag that the title and price elements are grouped under and calling their separate tags. While using the print()
command will print results to the terminal screen it can’t be saved to an output file like .csv
or .json. To save the results to a file you need to use the yield
command:
yield { 'Title': title, 'Price': price }
Now the spider is ready to crawl the site and grab just the titles and prices, it should look like this:
# Import library import scrapy # Create Spider class class booksToScrape(scrapy.Spider): # Name of spider name = 'books' # Website you want to scrape start_urls = [ 'http://books.toscrape.com' ] # Parses the website def parse(self, response): # Book Information cell for i in response.css('.product_pod'): # Attributes title = i.css('a[title]::text').getall() price = i.css('.price_color::text').getall() # Output yield { 'Title': title, 'Price': price }
Let’s crawl the site and see what we get, I’ll be using scrapy crawl books -o Books.csv
from the terminal.
We now have the data we were after and can use it in other programs. Granted this isn’t much data, it’s being used to demonstrate how the tool is used. You can use this spider to explore the other elements on the page.
Conclusion
Scrapy isn’t easy to learn and many are discouraged. I wanted to give those interested in it a quick way to start using it and see how it works. Scrapy is capable of so much more. I’ve just scratched the surface with what wrote about it. To learn more, check the official documentation.