β Disclaimer: This tutorial considers that you have the basic knowledge of web scraping. The purpose of this article is to educate you on how to rotate proxies and avoid being blocked while web scraping. The examples and theories mentioned in this tutorial are solely for educational purposes and it is considered that you will not misuse them. In case of any missuses, it is solely your responsibility, and we are not responsible for it. If you are interested in learning the basic concepts of web scraping before diving into this tutorial, please follow the lectures at this link.
Video Walkthrough
An Introduction to Proxies
A major challenge faced by web scrapers is being blocked by the webservers. Organizations have introduced technologies like Captchas to stop bot-like behaviours on their webservers. Thus, it is extremely important as a web scraper to ensure that you do not scrape recklessly using your crawler to reveal a bot-like behaviour and eventually get blocked by the webserver. There are numerous ways of achieving this, and one of the most efficient ways is to keep rotating your IP address and change the user agent as frequently as possible when you scrape a certain website.
So, what are Proxies?
Proxies or Proxy Server is an intermediate server that resides between the client (your browser) and the destination server. Simply put, you can think of a proxy server as a gateway between your machine and the web server that you want to scrape. So, when you use a proxy server to send a request to the webserver, the destination web server actually receives the request from a different IP which is the IP of the proxy server, and it has no idea about your IP address (exceptions always exist π). Hence, a proxy server allows you to access another website by hiding your IP, thereby providing you with an added level of security/anonymity.
Let’s have a look at the benefits of rotating proxies (IP addresses) while web scraping:
- You can continue and retry scraping a webpage even after the initial IP address has been blocked.
- You get an added level of security as your IP, and thus your location is not revealed.
- Contents that are region-specific, i.e., geo-restricted, can be easily accessed with the help of proxies.
- Webservers find it very difficult to understand bot-like behaviour if IPs and user agents are frequently rotated. This is because it gets thousands of requests coming from different IPs. So, it thinks that the requests have been made by different users.
That is why rotating IP addresses are so important in web scraping. We have an idea about the importance of proxies; let us now dive learn how we can change our IP address.
How to Send Requests Through a Proxy in Python?
Approach: You can use the requests library to send a request to the webserver through a proxy by passing the proxy within the proxies argument of the requests.get()
method
Example: In the following example, we will send a request to the following website: http://ip.jsontest.com/
This will list the IP address being used by us to send the request. Initially, we will not use any proxy and extract our original IP. Then we will use a proxy and find if we managed to change/hide our original IP address with the help of the proxy.
Case 1: Sending request through original IP
Case 2: Using Proxy to Send Request to Web Server
Explanation: In the above example, we passed the proxy within the proxy
dictionary and then used it within the requests.get
method by passing the dictionary to the proxies
argument.
Finding Free Proxy List
CAUTION: Using free proxies is not recommended as most of the proxies expire and are of no use as most of them are already blocked by the servers. This is because of their global availability, which can be used by millions of users. Hence, use a premium proxy list if possible.
If you wish to use a free proxy list, then automating the process is the best approach. Since free proxies expire soon, you should keep refreshing your proxy list. Manually creating the list can be extremely frustrating and tedious hence the best way to prune working free proxies is to use a script to find the working proxies.
Example: The following example demonstrates how you can extract the working proxies from https://free-proxy-list.net/
Step 1: Open https://free-proxy-list.net/ and copy the raw list as shown below.
Step 2: Store the list in a .txt file and extract each proxy one by one and store them in a list as shown in the following snippet:
proxy_list = [] with open('free_proxy.txt') as f: for line in f: print(line) proxy_list.append(line.strip())
The above snippet will store all the extracted IPs from the file to a list.
Step 3: Check if each proxy is functional or non-functional. An active proxy will return the status as 200 upon getting a get request. Therefore, store the functional IP addresses in another list as shown below.
import random import requests # storing IPs from file to list proxy_list = [] with open('free_proxy') as f: for line in f: print(line) proxy_list.append(line.strip()) # storing functional IPs in a list working_proxies = [] for i in proxy_list: print(i) try: proxy = { 'http': 'http://' + i, 'https': 'http://' + i } response = requests.get('http://example.org', proxies=proxy) working_proxies.append(i) except: pass print(working_proxies)
Note: This might take some time but it is definitely far less time consuming than checking each proxy one by one manually.
Rotating Requests Using Proxy Pool
Once the pool of functional IPs is ready, we can use it to rotate the IPs to send requests to the webserver. To select a random IP from the list, we must use the random.choice() method and then use the IP extracted by the method to send a get request to the server.
Example: The following code illustrates the entire process. We will first create a pool of functional IPs. Then we will use random IPs from this pool to send numerous requests to the server. Every request is sent with a different IP address from the pool of IPs that we created, thereby leveraging added level of security and anonymity.
import random import requests # storing IPs from file to list proxy_list = [] with open('free_proxy') as f: for line in f: print(line) proxy_list.append(line.strip()) # storing functional IPs in a list working_proxies = [] for i in proxy_list: print(i) try: proxy = { 'http': 'http://' + i, 'https': 'http://' + i } response = requests.get('http://example.org', proxies=proxy) working_proxies.append(i) except: pass print(working_proxies) # rotating IPs from working_proxies considering we want to send 5 requests for i in range(5): random_ip = random.choice(working_proxies) # rotating IPs from working_proxies proxy = { 'http': 'http://' + random_ip, 'https': 'http://' + random_ip } res = requests.get('http://ip.jsontest.com/', proxies=proxy) print(f"Request received from following IP:\n{res.text}")
CAUTION
- You should not change or rotate IP Addresses after logging in or if you are using a session.
- It is bad practice to use IP addresses that fall in the same sequence. This is because anti-scraping tools can easily detect that the requests are coming from a bot if it gets thousands of requests from the same IP sequence.
- Purchase and use premium proxies if you are scraping thousands of pages.
- Rotate IPs along with User agents to avoid detection.
Phew! That was all for this lecture on using proxies. Stay tuned for more information.
One of the most sought-after skills on Fiverr and Upwork is web scraping.
Make no mistake: extracting data programmatically from websites is a critical life skill in todayβs world thatβs shaped by the web and remote work.
This course on Finxter Academy teaches you the ins and outs of Pythonβs BeautifulSoup library for web scraping.