Pagination in Webscraping

Rate this post

 Disclaimer: This tutorial considers that you have the basic knowledge of web scraping. The purpose of this article is to educate you on how to scrape content from websites with pagination. The examples and theories mentioned in this tutorial are solely for educational purposes and it is considered that you will not misuse them. In case of any missuse, it is solely your responsibility, and we are not responsible for it. If you are interested in learning the basic concepts of web scraping before diving into this tutorial, please follow the lectures at this link.

What is Pagination in a Website?

Pagination refers to the division of entire web content into numerous web pages and displaying the content page by page for proper visualization and also to provide a better user experience. Pagination can be handled either on the client end or the server end.

While building a web scraper, it can be extremely challenging to scrape content if the website has implemented pagination. In this tutorial, we will learn about the different types of pagination in websites and how to scrape content from them.

Pagination Types

Pagination can be implemented in numerous ways, but most websites implement one of these types of pagination:

  • Pagination with Next button.
  • Pagination without Next button.
  • Infinite Scroll 
  • The Load More Button

Pagination with Next Button

The following example demonstrates a website that has the next button. Once the next button is clicked, it loads the next page.

source: http://books.toscrape.com/catalogue/category/books/default_15/index.html

Approach: The following video demonstrates how to scrape the above website.

Code:

# 1. Import the necessary LIBRARIES
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# 2. Create a User Agent (Optional)
headers = {"User-Agent": "Mozilla/5.0 (Linux; U; Android 4.2.2; he-il; NEO-X5-116A Build/JDQ39) AppleWebKit/534.30 ("
                         "KHTML, like Gecko) Version/4.0 Safari/534.30"}

# 3. Define Base URL
url = 'http://books.toscrape.com/catalogue/category/books/default_15/index.html'

# 4. Iterate as long as pages exist
while True:
    # 5. Send get() Request and fetch the webpage contents
    response = requests.get(url, headers=headers)
    # 4. Check Status Code (Optional)
    # print(response.status_code)
    # 6. Create a Beautiful Soup Object
    soup = BeautifulSoup(response.content, "html.parser")
    # 7. Implement the Logic.
    # (extract the footer)
    footer = soup.select_one('li.current')
    print(footer.text.strip())
    # Find next page element if present.
    next_page = soup.select_one('li.next>a')
    if next_page:
        next_url = next_page.get('href')
        url = urljoin(url, next_url)
    # break out if no next page element is present
    else:
        break

Output:

Page 1 of 8
Page 2 of 8
Page 3 of 8
Page 4 of 8
Page 5 of 8
Page 6 of 8
Page 7 of 8
Page 8 of 8

Pagination Without Next Button

The following example demonstrates a website that has no next button. Instead, it uses page numbers to allow navigation. Once a particular page number is clicked, it loads the corresponding page.

source: https://www.gosc.pl/doc/791526.Zaloz-zbroje/

Approach: The following video demonstrates how to scrape the above website.

Code:

# 1. Import the necessary LIBRARIES
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# 2. Create a User Agent (Optional)
headers = {"User-Agent": "Mozilla/5.0 (Linux; U; Android 4.2.2; he-il; NEO-X5-116A Build/JDQ39) AppleWebKit/534.30 ("
                         "KHTML, like Gecko) Version/4.0 Safari/534.30"}
# 3. Define Base URL
url = 'https://www.gosc.pl/doc/791526.Zaloz-zbroje/'

# 5. Send get() Request and fetch the webpage contents
response = requests.get(url,headers=headers)
# 4. Check Status Code (Optional)
# print(response.status_code)
# 6. Create a Beautiful Soup Object
soup = BeautifulSoup(response.content, 'html.parser')
# 7. Implement the Logic.
img_src = [img['src'] for img in soup.select('.txt__rich-area img')]
print('https://www.gosc.pl/'+img_src[0])
page = soup.select('span.pgr_nrs a')
flag = 0
for i in range(len(page)):
    next_page = page[flag].text
    flag+=1
    url = urljoin(url, next_page) # iteration 1: https://www.gosc.pl/doc/791526.Zaloz-zbroje/2
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    img_src = [img['src'] for img in soup.select('.txt__rich-area img')]
    for i in img_src:
        if i.endswith('jpg'):
            print('https://www.gosc.pl/'+i)

Output:

https://www.gosc.pl//files/old/gosc.pl/elementy/gn23s18_kolumbA.jpg
https://www.gosc.pl//files/old/gosc.pl/elementy/gn23s18_kolumbB.jpg
https://www.gosc.pl//files/old/gosc.pl/elementy/gn23s18_kolumbC.jpg
https://www.gosc.pl//files/old/gosc.pl/elementy/gn23s18_kolumbD.jpg
https://www.gosc.pl//files/old/gosc.pl/elementy/gn23s18_kolumbE.jpg
https://www.gosc.pl//files/old/gosc.pl/elementy/gn23s18_kolumbF.jpg

Infinite Scroll

source: https://pharmeasy.in/health-care/personal-care-877

Approach: The following video demonstrates how to scrape the above website.

Code:

# 1. Import the necessary LIBRARIES
import requests
# 2. Create a User Agent (Optional)
headers = {"User-Agent": "Mozilla/5.0 (Linux; U; Android 4.2.2; he-il; NEO-X5-116A Build/JDQ39) AppleWebKit/534.30 ("
                         "KHTML, like Gecko) Version/4.0 Safari/534.30"}
# 3. Define Base URL
url = 'https://pharmeasy.in/api/otc/getCategoryProducts?categoryId=877&page='
page_number = 1
try:
    while True:
        # 4. Send get() Request and fetch the webpage contents
        response = requests.get(url + str(page_number), headers=headers)
        # 5. Extract the json data from the page
        data = response.json()
        # 6. The Logic
        name = []
        price = []
        if len(data['data']['products']) == 0:
            break
        else:
            for d in data['data']['products']:
                print(d['name'])
        page_number += 1
except:
    pass

Pagination with Load More Button

source: https://smarthistory.org/americas-before-1900/

Approach: Please follow the entire explanation in the following video lecture that explains how you can scrape data from websites that have implemented pagination with the help of the load more button.

Code:

# 1. Import the necessary LIBRARIES
import requests
# 2. Create a User Agent (Optional)
headers = {"User-Agent": "Mozilla/5.0 (Linux; U; Android 4.2.2; he-il; NEO-X5-116A Build/JDQ39) AppleWebKit/534.30 ("
                         "KHTML, like Gecko) Version/4.0 Safari/534.30"}
# 3. Define Base URL
url = 'https://smarthistory.org/wp-json/smthstapi/v1/objects?tag=938&page={}'
# 4. The Logic
pg_num = 1
title = []
while True:
    response = requests.get(url.format(pg_num), headers=headers)
    data = response.json()
    d = data['posts']
    for i in d:
        for key,value in i.items():
            if key == 'title':
                title.append(value.strip())
    if data.get('remaining') and int(data.get('remaining')) > 0:
        pg_num += 1
    else:
        break
# print extracted data
for i in title:
    print(i)

One of the most sought-after skills on Fiverr and Upwork is web scraping .

Make no mistake: extracting data programmatically from web sites is a critical life-skill in today’s world that’s shaped by the web and remote work.

This course teaches you the ins and outs of Python’s BeautifulSoup library for web scraping.