Summary: Download a file over the web by using the following steps in Python.
- Import libary
requests
- Define URL string
- Get file data from URL
- Store file data in file object on your computer
Here’s how you can do this to download the Facebook Favicon (source):
At the beginning of our struggle with web scraping, you may have trouble downloading files using Python. However, this article will provide you with several methods that you can use to download, for example, the cover of a book from the page.
As an example, we will use pages that do not prohibit scraping: http://books.toscrape.com/catalogue/category/books_1/index.html
How to Check What I’m Allowed to Scrap?
To check what exactly you are not allowed to scrap, you have to add "robots.txt"
at the end in the url of the page. It should look like this: https://www.google.com/robots.txt. If the page does not specify what can be scrapped then you should check its terms sheet.
Okay, end of the introduction, let’s get started!
How To Install Modules in Python?
Before you can use any method, you must first install the module (if you don’t have it) using:
pip install module_name
For example:
pip install requests
How to Get a Link to the File?
To get a link to the file, navigate the cursor and right-click on anything you are looking for and press “Inspect Element”:
Then the source code of the page will pop up and point out immediately the element that interests us:
Next we have to copy the link to this file:
Depending on how the link looks like (whether it is full or not [if not, we have to prepare it for use]), we paste it into the search bar, to check if this is what we want:
And if it is, we use one of the methods provided.
Method 1 β requests Module
First we have to import the requests module and then create variables.
import requests url_to_the_file = 'http://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg' r = requests.get(url_to_the_file)
Once we have created the variables, we have to open the file in binary writing mode and save our file under some name with the extension that matches the file we want to download (if we want to download a photo, the extension must be for example jpg).
with open('A light in the attic β book cover.jpg', 'wb') as f: f.write(r.content)
Full code:
import requests url_to_the_file = 'http://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg' r = requests.get(url_to_the_file) with open('A light in the attic β book cover.jpg', 'wb') as f: f.write(r.content)
After the code is executed, the image will appear in the current working directory. With this method we can easily download a single image, but what if we want to download several files at once? Let’s go to the next method to learn it!
Method 2 β Requests Module & Beautifulsoup Class from bs4 Module
If you want to download several files from one page, this method is ideal. At the beginning we import the requests
and bs4
modules (from which we take the BeautifulSoup class) and create variables:
- url β link to the page from which you want to download files,
- result β link to the page and its html code,
- soup β BeautifulSoup class object (we use it to find elements),
- data β the data we are interested in, in this case the html code lines that start with <a> and end with </a> (these code lines have a href attribute which has a link to something).
import requests from bs4 import BeautifulSoup url = 'https://telugump3audio.com/devi-1999-songs.html' result = requests.get(url).content soup = BeautifulSoup(result, 'html.parser') data = soup.find_all('a')
Then we have to write a function that checks if the links have the mp3 extension and then the same function downloads files with this extension:
def get_mp3_files(data_): links = [] names_of_mp3_files = [] for link in data_: if '.mp3' in link['href']: print(link['href']) links.append(link['href']) names_of_mp3_files.append(link.text) if len(names_of_mp3_files) == 0: raise Exception else: for place in range(len(links)): with open(names_of_mp3_files[place], 'wb') as f: content = requests.get(links[place]).content f.write(content)
Full code:
import requests from bs4 import BeautifulSoup def get_mp3_files(data_): links = [] names_of_mp3_files = [] for link in data_: if '.mp3' in link['href']: print(link['href']) links.append(link['href']) names_of_mp3_files.append(link.text) if len(names_of_mp3_files) == 0: raise Exception else: for place in range(len(links)): with open(names_of_mp3_files[place], 'wb') as f: content = requests.get(links[place]).content f.write(content) url = 'https://telugump3audio.com/devi-1999-songs.html' result = requests.get(url).content soup = BeautifulSoup(result, 'html.parser') data = soup.find_all('a') get_mp3_files(data)
Using this method, we can download even dozens of files!
Method 3 β urllib Module
The urllib module is provided by default in Python, so you do not need to install it before use.
First, we import urllib.request
, because it contains the urlretrieve()
function, which allows us to download images or music files. This function has 4 arguments (1 obligatory and 3 optional), however the first two are most important:
- url β link to the file you want to get,
- filename β the name under which you want to save the file.
import urllib.request url = 'http://books.toscrape.com/media/cache/' \ '2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg' file_name = 'A light in the attic.jpg' urllib.request.urlretrieve(url, filename)
Note: According to the documentation, urllib.request.urlretrieve is a “legacy interface” and “might become deprecated in the future”
However, there is another way to download the file using this module:
import urllib.request url = 'http://books.toscrape.com/media/cache/' \ '2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg' file_name = 'A light in the attic.jpg' response = urllib.request.urlopen(url) html = response.read() with open(filename, 'wb') as f: f.write(html)
Using this method we also import urllib.request
, but we use other functions, first urlopen(
) to connect to the page, then read()
to save the html code of the page in a variable, next we open the file with the name saved in the filename variable and save the html code of the file in binary form. This way we have the file we wanted!
Method 4 β dload Module
- In Python version >= 3.6, you can also use the
dload
module to download a file. Thesave()
function has 3 arguments (1 mandatory, 2 optional): url
– link to the file,path
– the name under which you want to save your file, if you don’t specify a name, the name will depend on the ending of the link to the file (in our case the file would be called2cdad67c44b002e7ead0cc35693c0e8b.jpg
, so it is better to specify your filename),overwrite
β If there is a file with the same name in our working directory, it will overwrite it, if it equals True, and if False, it will not download the file (default = False).
import dload url = 'http://books.toscrape.com/media/cache/' \ '2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg' filename = 'A light in the attic.jpg' dload.save(url, filename)
Summary
You’ve learned an explanation of how to check if we have permission to download files. You’ve learned that there are 4 methods of downloading files using modules named in order: requests, requests in beautifulsoup, urllib in dload.
I hope this article will help you to download all the files you want.