The HTML iframe
tags are extensively used on a webpage to display advertisements, map locations, results, etc. When we’re scrapping a website, we might have to extract the data from the iframe
elements.
It is possible to extract the data from the iframe
elements. But the way of doing it is slightly different.
In this article, let’s understand what iframe
elements are and then discuss how to access the content within the iframe
tag.
What is an iframe tag?
When we want to embed a document within a given HTML document, we use the iframe
tag. Look at the below image to see what embedding with an iframe
tag looks like.
An iframe
can contain another webpage, a CSV file, a text file, an image, etc.
Now, let’s take a look at the HTML code for the above webpage.
<!DOCTYPE html> <html> <head> <title>HTML iframe Tag</title> </head> <body style="text-align: center"> <h1>iframedemo</h1> <h2>HTML iframe Tag</h2> <iframe src="https://www.finxter.com/" height="400" width="400"> </iframe> </body> </html>
Note that the iframe
tag contains an src
attribute, which contains the link (url
) to the document that has to be embedded within the iframe
.
How to access the contents from an iframe tag?
BeautifulSoup can only parse the HTML elements. It can’t fetch images or other kinds of objects.
Now, let’s see how to access the contents from an iframe
tag. Consider an HTML document as shown below:
<!DOCTYPE html> <html> <head> <title>HTML iframe Tag</title> </head> <body style="text-align: center"> <h1>iframedemo</h1> <h2>HTML iframe Tag</h2> <iframe src="https://www.wikipedia.org/" height="400" width="400"> </iframe> <iframe src="https://www.finxter.com/" height="400" width="400"> </iframe> </body> </html>for iframe in iframes:
To access the iframe
tags, let’s use the soup.find_all()
method.
from bs4 import BeautifulSoup import requests with open("demo.html") as f: soup=BeautifulSoup(f,'html.parser') iframes=soup.find_all('iframe') print(iframes)
Output:
[<iframe height="400" src="https://www.wikipedia.org/" width="400"> </iframe>, <iframe height="400" src="https://www.finxter.com/" width="400"> </iframe>]
As we can see from the output, this gives the list of iframe
tags. Now, let’s try accessing the src
attribute from the iframe
tag.
from bs4 import BeautifulSoup import requests with open("demo.html") as f: soup=BeautifulSoup(f,'html.parser') iframes=soup.find_all('iframe') for iframe in iframes: src=iframe['src'] print(src)
Output:
https://www.wikipedia.org/ https://www.finxter.com/
That gives us the source URLs. Note that beautiful soup cannot automatically open the contents in the URL. We have to access these URLs using the requests
module and then parse the contents of the webpage. Then try accessing the HTML elements.
Example – Let’s try fetching the URLs to the privacy policy from both the iframes.
from bs4 import BeautifulSoup import requests import re with open("demo.html") as f: soup=BeautifulSoup(f,'html.parser') iframes=soup.find_all('iframe') for iframe in iframes: src=iframe['src'] response = requests.get(src) if response.status_code == 200 : soup_src= BeautifulSoup(response.text,'html.parser') privacy_policy = soup_src.find('a',text=re.compile(".*Privacy Policy*.")) print(privacy_policy['href'])
Output:
https://meta.wikimedia.org/wiki/Privacy_policy https://blog.finxter.com/privacy-policy/
Conclusion
In this short tutorial, we’ve seen what an iframe
tag is and how to extract data from an iframe
tag. We hope this article has been informative. Do you want to improve your Python skills? Don’t miss subscribing to our email academy.
Thanks for reading.