How to Get the Contents from the iframe Tag using Beautiful Soup?

The HTML iframe tags are extensively used on a webpage to display advertisements, map locations, results, etc. When we’re scrapping a website, we might have to extract the data from the iframe elements.

It is possible to extract the data from the iframe elements. But the way of doing it is slightly different.

In this article, let’s understand what iframe elements are and then discuss how to access the content within the iframe tag.

What is an iframe tag?

When we want to embed a document within a given HTML document, we use the iframe tag. Look at the below image to see what embedding with an iframe tag looks like.

An iframe can contain another webpage, a CSV file, a text file, an image, etc.

Now, let’s take a look at the HTML code for the above webpage.

<!DOCTYPE html>
<html>

<head>
   <title>HTML iframe Tag</title>
</head>
<body style="text-align: center">
   <h1>iframedemo</h1>
   <h2>HTML iframe Tag</h2>
   <iframe src="https://www.finxter.com/"
           height="400"
           width="400">
   </iframe>
</body>

</html>

Note that the iframe tag contains an src attribute, which contains the link (url) to the document that has to be embedded within the iframe.

How to access the contents from an iframe tag?

BeautifulSoup can only parse the HTML elements. It can’t fetch images or other kinds of objects.

Now, let’s see how to access the contents from an iframe tag. Consider an HTML document as shown below:

<!DOCTYPE html>
<html>

<head>
   <title>HTML iframe Tag</title>
</head>
<body style="text-align: center">
   <h1>iframedemo</h1>
   <h2>HTML iframe Tag</h2>
   <iframe src="https://www.wikipedia.org/"
           height="400"
           width="400">
   </iframe>
   <iframe src="https://www.finxter.com/"
           height="400"
           width="400">
   </iframe>
</body>

</html>for iframe in iframes:

To access the iframe tags, let’s use the soup.find_all() method.

from bs4 import BeautifulSoup
import requests

with open("demo.html") as f:
   soup=BeautifulSoup(f,'html.parser')
   iframes=soup.find_all('iframe')
   print(iframes)

Output:

[<iframe height="400" src="https://www.wikipedia.org/" width="400">
</iframe>, <iframe height="400" src="https://www.finxter.com/" width="400">
</iframe>]

As we can see from the output, this gives the list of iframe tags. Now, let’s try accessing the src attribute from the iframe tag.

from bs4 import BeautifulSoup
import requests

with open("demo.html") as f:
    soup=BeautifulSoup(f,'html.parser')
    iframes=soup.find_all('iframe')
    for iframe in iframes:
        src=iframe['src']
        print(src)

Output:

https://www.wikipedia.org/
https://www.finxter.com/

That gives us the source URLs. Note that beautiful soup cannot automatically open the contents in the URL. We have to access these URLs using the requests module and then parse the contents of the webpage. Then try accessing the HTML elements.

Example – Let’s try fetching the URLs to the privacy policy from both the iframes.

from bs4 import BeautifulSoup
import requests
import re

with open("demo.html") as f:
  soup=BeautifulSoup(f,'html.parser')
  iframes=soup.find_all('iframe')
  for iframe in iframes:
     src=iframe['src']
     response = requests.get(src)
     if response.status_code == 200 :
        soup_src= BeautifulSoup(response.text,'html.parser')
        privacy_policy = soup_src.find('a',text=re.compile(".*Privacy Policy*."))
        print(privacy_policy['href'])

Output:

https://meta.wikimedia.org/wiki/Privacy_policy
https://blog.finxter.com/privacy-policy/

Conclusion

In this short tutorial, we’ve seen what an iframe tag is and how to extract data from an iframe tag. We hope this article has been informative. Do you want to improve your Python skills? Don’t miss subscribing to our email academy.

Thanks for reading.