Web Scraping With BeautifulSoup In Python

Summary: Web scraping is the process of extracting data from the internet. It is also known as web harvesting or web data extraction. Python allows us to perform web scraping using automated techniques. BeautifulSoup is a Python library used to parse data (structured data) from HTML and XML documents.

The internet is an enormous wealth of data. Whether you are a data scientist, a business person, a student, or a professional, all of you have scraped data from the internet. Yes, that’s right! I repeat – you have already scraped data if you have used the internet for your work or even entertainment. So what does web scraping mean? It is the simple act of extracting data from a website. Even copying and pasting data from the internet is web scraping. So if you have downloaded your favorite song from the internet or copied your favorite quote from the web, it means you have already scrapped data from the internet.

In this article we are going to explore some of the most frequently asked questions regarding web scraping and then we shall go through the entire process of creating a web scraper and visualize how we can automate the task of web scraping! So without further delay let us begin our journey with web scraping.

What is Web Scraping?

Web scraping is the process of extracting data from the internet. It is also known as web harvesting or web data extraction. Python allows us to perform web scraping using automated techniques.

Some of the most commonly used libraries in Python for web scraping are:

  • The requests library.
  • The Beautiful Soup 4 library.
  • Selenium.
  • Scrapy.

In this article we are going to explore the BeautifulSoup library and the requests library to scrape data from the website.

Why Do We Scrape Data From The Internet?

Web scraping if performed using the proper guidelines can prove to be extremely useful and can make our life easy by automating everyday tasks that we perform repeatedly over the internet.

  • If you are a data analyst and you need to extract data from the internet on a day to day basis then creating an automated web crawler is the solution to reducing your burden of extracting data manually every day.
  • You can use web scrappers to extract information about products from online shopping websites and compare product prices and specifications.
  • You can use web scraping for content marketing and social media promotions.
  • As a student or a researcher, you can use web scraping to extract data for your research/project from the web.

The bottom-line is, “Automated web scraping allows you to work smart!”

Is Web Scraping Legal?

Now, this is a very important question but unfortunately, there is no specific answer for this. There are some websites that don’t mind if you scrape content from their webpage while there are others that prohibit content scraping. Therefore it is absolutely necessary that you follow the guidelines and do not violate the website’s policies while scraping content from their webpage.

Let us have a look at the few important guidelines that we must keep in mind while scraping content over the internet.

Remember:

Before we dive in to web scraping it is important that we understand how the web works and what is hypertext markup language because that is what we are going to extract our data from. Hence, let us have a brief discussion upon the HTTP request response model and HTML.

The HTTP Request/Response Model

The entire working principle of how the web works can be quite complicated but let us try and understand things at a simple level that would give us an idea of how we are going to approach web scraping.

In simple words, the HTTP request/response is a communications model used by HTTP and other extended protocols that are based on HTTP according to which a client (web browser) sends a request for a resource or a service to the server and the server sends back a response corresponding to the resource if the request is successfully processed otherwise the server responds with an error message in case it is unable to process the request.

There are numerous HTTP methods used to interact with the web server; but the most commonly used ones are get and post.

  • GET : used to request data from a specific resource in the web server.
  • POST : used to send data to a server to create/update a resource.

Other HTTP methods are:

  • PUT
  • HEAD
  • DELETE
  • PATCH
  • OPTIONS

Note: To scrape data from a website we will send a request to the web server using the requests library along with the get() method.

HTML – Hypertext Markup Language

Though HTML is a topic of discussion in itself and it is beyond the scope of this article, however you must be aware of the basic structure of HTML. Do not worry, you do not need to learn how to design a webpage using HTML and CSS but you must be aware of some of the key elements/tags used while creating a webpage using HTML.

💡 HTML has a hierarchical / tree structure. This property enables us to access elements of the HTML document while scraping the webpage based on their parent and child relationship. In order to visualize the HTML tree structure let us have a look at the image given below.

I have a listed a couple of links if you want to further explore and learn about how HTML works :

Creating The Web Scraper

Now let us begin creating our web scraper. The website that we are going to scrape is a job dashboard which lists the most recent Python jobs. In this walkthrough we shall be scraping:

  • The Job Title
  • The Location Of the Job
  • The Name Of the Organization

Website to be scraped: The Free Python Job Board

Step 1: Navigate and Inspect The Website/Webpage

The first and foremost task while scraping data from any webpage is to open up the webpage from which we are scraping the data and inspect the website using developer tools. You may also view the page source.

To navigate using developer tools:

  1. Right click on the webpage.
  2. select Inspect.

Note: Inspect element is a developer tool implemented into most web browsers which include Google Chrome, Firefox, Safari, and Internet Explorer. It allows us to view and edit the HTML and CSS source code at the backend. The changes made to the code are reflected in real-time in your browser window. The best part is you don’t have to worry about breaking the page while you play around with the code because the changes made by you will only take effect for the duration of your session, and are only reflected on your screen. In other words, Inspect Element provides us a sort of ‘what if’ experience without affecting the content for any other user.

To view page source:

  1. right click on web page.
  2. select View page source

Therefore, initially, we need to drill down the HTML source code and identify the elements that we have to focus upon while scraping the contents. Thus, the image given below denotes the sections that we need to work upon while scraping.

Step 2: Create The User-Agent

A user agent is a client (typically a web browser) that is used to send requests to the webserver on behalf of the user. While getting automated requests again and again from the same machine/system, the web server might guess that the request is automated and is being sent by a bot. Thus it blocks the request. Therefore we can use a user agent to fake a browser visit to a particular webpage which makes the server believe that the request was from an original user and not a bot.

Syntax:

# create User-Agent (optional)
headers = {"User-Agent": "Mozilla/5.0 (CrKey armv7l 1.5.16041) AppleWebKit/537.36 (KHTML, like Gecko) "
                         "Chrome/31.0.1650.0 Safari/537.36"}
# passing the user agent as a parameter along with the get() Request
response = requests.get("http://pythonjobs.github.io/", headers=headers)

Step 3: Import The Requests Library

✨ The Requests Library

The requests library allows us to send the get request to web server.

Here’s how this works:

  • Import the Python library requests that handles the details of requesting the websites from the server in an easy-to-process format.
  • Use the requests.get(...) method to access the website and pass the URL 'http://pythonjobs.github.io/' as an argument so that the function knows which location to access.
  • Access the actual body of the get request (the return value is a request object that also contains some useful meta information like the file type, etc.) and store it in a variable using the .content attribute.

Syntax:

import requests

# create User-Agent (optional)
headers = {"User-Agent": "Mozilla/5.0 (CrKey armv7l 1.5.16041) AppleWebKit/537.36 (KHTML, like Gecko) "
                         "Chrome/31.0.1650.0 Safari/537.36"}
# get() Request
response = requests.get("http://pythonjobs.github.io/", headers=headers)
# Store the webpage contents
webpage = response.content

✨ Checking The Status Code

Once the HTTP request is processed by the server it sends a response that contains a status code. The status code indicates whether a specific response was successfully processed or not.

There are majorly 5 different categories of status codes:

Syntax:

print(response.status_code)

Step 4: Parse HTML using BeautifulSoup Library

✨ The BeautifulSoup Library

BeautifulSoup is a Python library used to parse data (structured data) from HTML and XML documents.

  • Import the BeautifulSoup Library.
  • Create the BeautifulSoup Object. The first parameter represents the HTML data while the second parameter is the parser.

Syntax:

import requests
from bs4 import BeautifulSoup
# create User-Agent (optional)
headers = {"User-Agent": "Mozilla/5.0 (CrKey armv7l 1.5.16041) AppleWebKit/537.36 (KHTML, like Gecko) "
                         "Chrome/31.0.1650.0 Safari/537.36"}
# get() Request
response = requests.get("http://pythonjobs.github.io/", headers=headers)
# Store the webpage contents
webpage = response.content
# Check Status Code (Optional)
# print(response.status_code)
# Create a BeautifulSoup object out of the webpage content
soup = BeautifulSoup(webpage, "html.parser")

Once we have created the BeautifulSoup object, we need to use different options provided to us by the BeautifulSoup library to navigate and find elements within the HTML document and scrape data from it.

💡 Attention:

In case you want to understand how to navigate through the HTML document using the components of the BeautifulSoup library, please refer to our tutorial to learn about the various options provided by BeautifulSoup to parse an HTML document.

Let us have a look at the code and then we will understand the working principle/logic behind it.

# The logic
for job in soup.find_all('section', class_='job_list'):
    title = [a for a in job.find_all('h1')]
    for n, tag in enumerate(job.find_all('div', class_='job')):
        company_element = [x for x in tag.find_all('span', class_='info')]
        print("Job Title: ", title[n].text.strip())
        print("Location: ", company_element[0].text.strip())
        print("Company: ", company_element[3].text.strip())
        print()
  • In the outer loop i.e. for job in soup.find_all('section', class_='job_list'), we find the parent element, which in this case is the section tag having an HTML class with the name job and then iterate over it.
  • The title variable represents a list comprehension and is used to store the job titles. In other words, the job.find_all('div', class_='job') function is used to search all div tags having the class name job and then store the data in the list title.
  • The inner loop i.e. for n, tag in enumerate(job.find_all('div', class_='job')) has a couple of functionalities:
    1. Search all div elements with the class info.
    2. Keep count of each iteration with the help of the enumerate function.
  • Inside the inner loop, the list comprehension company_element stores all contents that are within the span tag with the class info.
  • Finally, with the help of the counter n of the enumerate function, we extract the elements of the title tag (that stores the job titles) with help of their index. The location and company names are extracted from the 0th and 3rd index of the list company_element.

The Final Solution

Now let us consolidate all the steps to reach the final solution/code as shown below:

import requests
from bs4 import BeautifulSoup
# create User-Agent (optional)
headers = {"User-Agent": "Mozilla/5.0 (CrKey armv7l 1.5.16041) AppleWebKit/537.36 (KHTML, like Gecko) "
                         "Chrome/31.0.1650.0 Safari/537.36"}
# get() Request
response = requests.get("http://pythonjobs.github.io/", headers=headers)
# Store the webpage contents
webpage = response.content
# Check Status Code (Optional)
# print(response.status_code)
# Create a BeautifulSoup object out of the webpage content
soup = BeautifulSoup(webpage, "html.parser")
# The logic
for job in soup.find_all('section', class_='job_list'):
    title = [a for a in job.find_all('h1')]
    for n, tag in enumerate(job.find_all('div', class_='job')):
        company_element = [x for x in tag.find_all('span', class_='info')]
        print("Job Title: ", title[n].text.strip())
        print("Location: ", company_element[0].text.strip())
        print("Company: ", company_element[3].text.strip())
        print()

Output:

Job Title: Software Engineer (Data Operations)
Location:  Sydney, Australia / Remote
Company:  Autumn Compass

Job Title: Developer / Engineer
Location:  Maryland / DC Metro Area
Company:  National Institutes of Health contracting company.

Job Title: Senior Backend Developer (Python/Django)
Location:  Vienna, Austria
Company:  Bambus.io

Hurrah! We have successfully created our first web scraper script.

Examples

As the saying goes – “Practice makes a man perfect!” Therefore, please have a look at the following article which lists the process of web scraping with the help of five examples. Click on the button/link given below to have a look at these examples and practice them to master the skill of web scraping using Python’s BeautifulSoup library.

Conclusion

I hope that after reading the entire article you can scrape data from webpages with ease! Please read the supporting articles in order to get a stronger grip on the mentioned concepts.

Please subscribe and stay tuned for more interesting articles in the future.

Where to Go From Here?

Enough theory, let’s get some practice!

To become successful in coding, you need to get out there and solve real problems for real people. That’s how you can become a six-figure earner easily. And that’s how you polish the skills you really need in practice. After all, what’s the use of learning theory that nobody ever needs?

Practice projects is how you sharpen your saw in coding!

Do you want to become a code master by focusing on practical code projects that actually earn you money and solve problems for people?

Then become a Python freelance developer! It’s the best way of approaching the task of improving your Python skills—even if you are a complete beginner.

Join my free webinar “How to Build Your High-Income Skill Python” and watch how I grew my coding business online and how you can, too—from the comfort of your own home.

Join the free webinar now!