This article is based on a freelance job posted on Upwork to scrape data for all the gyms in the USA from MindBodyOnline.com or another similar site. I treated this as a learning project, and it was a good one, as I learned a lot!
🕷️ Web scraping, a technique used to extract data from websites, has become an essential skill on Upwork — it’s one of the most sought-after skills on most freelancing platforms. Most beginners start with the Beautiful Soup and Requests modules in Python. While these tools are powerful, they’re not always sufficient for every site. Enter tools like Selenium, which, while powerful, can sometimes be overkill or inefficient.
So, where should one start? The answer is simple: Always check for an API first.
Why Start with APIs?
An Application Programming Interface (API) allows two software applications to communicate with each other. Many websites offer APIs to provide structured access to their data, making it easier and more efficient than scraping the web pages directly.
Benefits of using APIs:
- Efficiency: Extracting data from APIs is often faster and less resource-intensive than scraping web pages.
- Reliability: APIs are designed to be accessed programmatically, reducing the chances of breaking changes.
- Ethical considerations: Accessing data via an API is often more in line with a website’s terms of service than scraping their pages directly.
MindBodyOnline provides a dedicated API tailored for developers: MindBody API.
If you’re aiming to craft an app utilizing their dataset, this API is your ideal resource. It boasts a plethora of endpoints, enabling swift data retrieval and ensuring seamless interaction between your application and their servers.
But what if you aren’t creating an application and just need to scrape data once for research? MindBodyOnline also retrieves data for its website via an API. Javascript is used to request the data needed to populate their website. We can also make requests for this API.
How to check if a website is rendered with Javascript
The site we will be scraping is MindBodyOnline.
If a website is rendered with Javascript, we should check the network traffic and see if we can find a request that returns the data we see on the page. This can be done quickly with developer tools. With Chrome, you can bring up developer tools by clicking Ctl-Shift-I
.
From here, we can turn off Javascript, then refresh the page and see if there are any changes. To turn off Javascript, first hit Ctl-Shift-P
to bring up the command palette. Start typing Javascript to filter the options, then click “Disable javascript”.
Then refresh the page. As we can see, they use JavaScript for all the data.
Before we can continue, we need to turn JavaScript back on. Bring up the command palette again, filter for javascript, and click “Enable Javascript”. Then refresh the page again.
Check the JavaScript Requests
Select the Network tab in developer tools.
Make sure Fetch/XHR
and Preserve log
are selected. Next, we can click the circle with the line through it to clear the output. Then perform a search to see what requests were performed.
We can then check each item in the output to see if it returns useful information.
We are primarily interested in the response to the request. We are looking for XML data that looks like the data shown on the page. In this case, it is the locations
request that contains the data we seek.
We can also see that there is a payload required. When we make our requests, we must provide this payload in the request body. There are three items of interest here. The latitude and longitude allow us to control the city we are pulling data for, and we also need to provide a page number.
MindBody uses pagination, so a relatively small amount of data is pulled with each request. A large city like New York can have over a hundred pages.
We go to the headers tab to copy the request URL.
Using Insomnia to Generate Request Headers
From here, we can use a tool to help us with the request syntax.
🔗 Insomnia is a powerful open-source API client tool for testing and debugging APIs. It provides a user-friendly interface to send requests to web services and view responses. With Insomnia, you can define various request types, from simple HTTP GET requests to complex JSON, GraphQL, or even multipart file uploads. You can download the insomnia desktop app here.
Using Insomnia is quite simple. Just paste in the API URL and click Send
.
We can check the preview tab to make sure it returns the data we want:
This is where it gets good. If we click the dropdown on the send button, one of the options is “generate client code”. How convenient! Just click Python as the language and use the Requests module and you can click “Copy to Clipboard” and you’re off to the races.
A Simple Scrapy Spider
The code can be found on Github. I will walk through the code below, starting with the imports.
import scrapy import json import pandas as pd from scrapy.crawler import CrawlerProcess import os
Scrapy is a good option because it can handle multiple requests at the same time with asynchronous processing. Scapy has a lot of bells and whistles and a fair bit of a learning curve, but it’s also possible to avoid a lot of the extra complexity. The goal here was to place all the code in one simple script.
First, we have to create a spider class. The class is pretty large so I’ll display it in chunks.
class MindbodySpider(scrapy.Spider): name = 'mindbody_spider' custom_settings = { 'CONCURRENT_REQUESTS': 5, 'DOWNLOAD_DELAY': 3.2, }
Our class inherits from one of the Scrapy Spider
classes with scrapy.Spider
being the simplest. In the custom settings, with CONCURRENT_REQUESTS
set to 5
, scrapy will be processing six requests at a time, starting a new one as soon as one finishes.
We use a DOWNLOAD_DELAY
so we don’t bombard the website with too many requests at once.
Next, we need a starting template for the payload
starting_payload = '''{ "sort":"-_score,distance", "page":{"size":50,"number":<<num>>}, "filter":{"categories":"any", "latitude":<<lat>>, "longitude":<<lon>>, "categoryTypes":"any"} }'''
Next, we have the headers that Insomnia so helpfully provided for us.
headers = { "cookie": "__cf_bm=zdIhLHXKd2OAveBChKORUMdydUFVzC2Ma51sQxv.UJ0-1694646164-0-Abmbwcj2wNw%2FpityY4DWRWy%2FftBkjTO0vQ3tZ0gwU0P5bsTqcasf2XZlBwL%2BUaevGaH%2BTDzZOJPBXbWYwgsXkJc%3D", "authority": "prod-mkt-gateway.mindbody.io", "accept": "application/vnd.api+json", "accept-language": "en-US,en;q=0.9", "content-type": "application/json", "origin": "https://www.mindbodyonline.com", "sec-ch-ua": "^\^Not/A", "sec-ch-ua-mobile": "?0", "sec-ch-ua-platform": "^\^Windows^^", "sec-fetch-dest": "empty", "sec-fetch-mode": "cors", "sec-fetch-site": "cross-site", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36", "x-mb-app-build": "2023-08-02T13:33:44.200Z", "x-mb-app-name": "mindbody.io", "x-mb-app-version": "e5d1fad6", "x-mb-user-session-id": "oeu1688920580338r0.2065068094427127" }
Then a very simple init
method
def __init__(self): scrapy.Spider.__init__(self) self.city_count = 0
The start_requests
method loops through each city. This is the main loop that creates the first request for each city.
def start_requests(self): cities = pd.read_csv('uscities.csv') for idx, city in cities[].iterrows(): lat, lon = city.lat, city.lng self.logger.info(f"{city.city}, {city.state_id} started") # Start with the first page for each city payload = self.starting_payload.replace('<<pg>>', '1').replace('<<lat>>', str(lat)).replace('<<lon>>', str(lon)) yield scrapy.Request( url="https://prod-mkt-gateway.mindbody.io/v1/search/locations", method="GET", body=payload, headers=self.headers, meta={'city_name': city.city, 'page_num': 1, 'lat': lat, 'lon': lon, 'state': city.state_id}, callback=self.parse )
The code is pretty simple. We create a DataFrame from a CSV file with city information and then loop through it with the iterrows
method. We create the payload for the request using the template and the lat/long values from the DataFrame. The page is set to 1 each time. We will handle additional pages later.
Finally, we yield a scrapy.Request
object. We use yield
instead of return
so we can handle multiple requests concurrently. The body is our modified payload, and we use the same header for each request.
What do we do with the response returned from the request? As soon as the response is returned it is fed into the parse method thanks to the callback parameter:
callback=self.parse
The meta
parameter gives us a way to pass information to the callback
function. We need the page num
, lat
, lon
values for the next request. City_name
and state
are used for screen outputs.
The list of cities was downloaded off the web. Many different options will work, as long as they contain latitude and longitude values.
Parsing the Response
The parse
method is a little long, but not too complicated.
Getting the data and saving it is very easy. We just convert response.text
to a DataFrame and save it to a CSV file. If the file already exists, we will append the data and not include a header. Otherwise, we create a new CSV file and include a header.
def parse(self, response): data = json.loads(response.text) gyms_df = pd.json_normalize(data['data']) # Save the dataframe to a CSV city_name = response.meta['city_name'] state = response.meta['state'] fname = f'{city_name} {state}.csv'.replace(' ', '_') csv_path = f'./data/cities2/{fname}' # Check if file exists to determine the write mode write_mode = 'a' if os.path.exists(csv_path) else 'w' gyms_df.to_csv(csv_path, mode=write_mode, index=False, header=(not os.path.exists(csv_path)))
Handling Pagination
To move on to the next page, we need to create another Scrapy Request. For the payload we use the same latitude and longitude and increment the page number by 1.
# Check if there's another page and if so, initiate the request next_page_num = response.meta['page_num'] + 1 if next_page_num <= 150: # Optional: upper limit lat, lon = response.meta['lat'], response.meta['lon'] # Assuming you store lat and lon in meta too payload = self.starting_payload.replace('<<pg>>', '1').replace('<<lat>>', str(lat)).replace('<<lon>>', str(lon))
Make the Request for the Next Page
To finish the parse
method, all we have to do is make another request with the new payload.
yield scrapy.Request( url="https://prod-mkt-gateway.mindbody.io/v1/search/locations", method="GET", body=payload, headers=self.headers, meta={'city_name': response.meta['city_name'], 'page_num': next_page_num, 'lat': lat, 'lon': lon, 'state': state}, callback=self.parse ) self.city_count += 1 print(response.meta['city_name'], f'complete ({self.city_count})') self.logger.info(f"""{response.meta['city_name']}, {response.meta['state']} is complete""")
How the Pagination Loop Terminates
What happens if there are 100 pages for the current city and the code sends a request with page_num = 101
?
The request will not return anything, so the callback function won’t get called and the recursive loop for that city will stop.
Then the start_requests
loop will move on to the next city.
It’s alive! Setting Our Little Spider Loose
To get our creepy critter crawling, we create a CrawlerProcess
. Then tell it to crawl. Then tell it to start. On your mark, get set, CRAWL!
process = CrawlerProcess() process.crawl(MindbodySpider) process.start()
Results
I was able to scrape data for 16,000 cities in about half a week. I think I averaged about 100 cities an hour. The larger cities had over a hundred pages but there were thousands upon thousands of cities with 5-10 pages.
What about the data? It’s fairly extensive and could be very useful.
Pretty good information related to services offered, location, amenities, total ratings etc. Looking at the rest of the columns:
Conclusion
Uncovering the API proved invaluable. It eliminated the need to craft path selectors for individual data elements, significantly streamlining the process. Moreover, it spared me from devising a Scrapy workaround for the JavaScript-rendered page. Investing time in learning Scrapy was a sound decision, given its superior speed compared to other methods I explored.
Looking ahead, the logical progression is to integrate the data into platforms like Jupyter Notebook, Power BI, or Tableau. Furthermore, storing the data in a database seems apt, especially considering the apparent one-to-many relationships observed in each city, like categories and subcategories.
If you want to become a master web scraper, feel free to check out our academy course with downloadable PDF certificate to showcase your skills to future employers or freelancing clients:
🔗 Academy: Web Scraping with BeautifulSoup