How to Remove Extra Whitespaces in BeautifulSoup?

5/5 - (1 vote)

BeautifulSoup is a library used for parsing web pages.

As the library is simple to access and use, it is extensively used by the developers for Web Scraping activities. If the webpage is in the HTML format, we can parse the webpage using an HTML parser. After parsing the document, we can filter only the required tags and fetch the data. 

However, it is important to note that if there are any spaces in the HTML document, they will be printed as it is. Consider the following example. This is a list of comments on a user’s posts on a social media platform.

<div>
<li><span class="Mr508">
                    This post is so informative!
                </span></li>
<li><span class="Mr508">
                   Informative
               </span></li>
<li><span class="Mr508">
                   Thanks for posting
                </span></li>
</div>

Fetching Text Values Without Spaces

If you notice carefully, there are a lot of extra spaces. When you fetch the value, the extra spaces will also be present. Refer to the below code snippet for details: 

from bs4 import BeautifulSoup
import re
html=""" 
<div>
<li><span class="Mr508">
                    This post is so informative!
                </span></li>
<li><span class="Mr508">
                   Informative
               </span></li>
<li><span class="Mr508">
                   Thanks for posting
                </span></li>
</div>
"""
soup=BeautifulSoup(html,'html.parser')
output=soup.find_all('div')
for ele in output:
    print(ele.text)

Output:

                    This post is so informative!
                 

                    Informative
                

                    Thanks for posting

Now, how do we remove the extra spaces from the value? 

In today’s article, let’s discuss different ways of removing extra whitespaces from the HTML document.

Method 1: Using str.strip()

The simplest way of removing extra spaces from the HTML is by using the str.strip():

soup=BeautifulSoup(html,'html.parser')
output=soup.find_all('li')
for ele in output:
    print(ele.text.strip())

Output:

This post is so informative!
Informative
Thanks for posting

Method 2: Using stripped_strings

Beautiful Soup supports a string generator object called stripped_strings, that when called on the soup element, removes all the extra spaces.

Refer to the below example for more details.

soup=BeautifulSoup(html,'html.parser')
output=soup.find('div')
for ele in output.stripped_strings:
   print(ele)

Output:

This post is so informative!
Informative
Thanks for posting

However, note that stripped_strings can be called only on string objects. If we were to use find_all('li') in the above example, it would return a list object. Calling stripped_strings on a list object would result in an error as shown below.

soup=BeautifulSoup(html,'html.parser')
output=soup.find_all('li')
for ele in output.stripped_strings:
   print(ele)

Output:

Traceback (most recent call last):
  File "C:\Users\paian\PycharmProjects\Finxter\venv\Solutions\How to remove white spaces using beautiful soup.py", line 18, in <module>
    for ele in output.stripped_strings:
  File "C:\Users\paian\PycharmProjects\Finxter\venv\lib\site-packages\bs4\element.py", line 2253, in __getattr__
    raise AttributeError(
AttributeError: ResultSet object has no attribute 'stripped_strings'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

Fetching Both Tags and Values Without Spaces 

At times, we might be interested in fetching the portion of the HTML document as it is without any extra spaces.

That, is, from the above example, we might need all the elements from the div tag, but without unnecessary extra spaces as shown below.

<div><li><span class="Mr508">This post is so informative!</span></li><li><span class="Mr508">Informative</span></li><li><span class="Mr508">Thanks for posting</span></li></div>

We can use one of the below ways to achieve this.

Method 1: Using str.strip()

We can use the str.strip() method to get rid of extra spaces from the HTML document as shown below.

soup=BeautifulSoup(html,'html.parser')
output=soup.find('div')

# Method 1 - Using strings
html_string=[]
for ele in str(output).split("\n"):
   html_string.append(ele.strip())
#merge the list to a string
print("".join(html_string))

Output:

<div><li><span class="Mr508">This post is so informative!</span></li><li><span class="Mr508">Informative</span></li><li><span class="Mr508">Thanks for posting</span></li></div>

Alternatively, we can also use list comprehensions to achieve the same.

soup=BeautifulSoup(html,'html.parser')
output=soup.find('div')

# Method 1 - Using strings
print("".join([ele.strip() for ele in str(output).split("\n")]))

Output:

<div><li><span class="Mr508">This post is so informative!</span></li><li><span class="Mr508">Informative</span></li><li><span class="Mr508">Thanks for posting</span></li></div>

Method 2: Using regular expressions

We can also remove the white spaces in HTML using the regular expressions.

The expression,

  • [\n] matches all the new-line characters in the string .
  • [\ ]{2,} matches two or or more spaces in the string. 

We can replace these with an empty character. Thus removing the extra spaces in the document.

soup=BeautifulSoup(html,'html.parser')
output=soup.find('div')
pattern=re.compile("([\n])|([\ ]{2,})")
print(re.sub(pattern,'',str(output)))

Output:

<div><li><span class="Mr508">This post is so informative!</span></li><li><span class="Mr508">Informative</span></li><li><span class="Mr508">Thanks for posting</span></li></div>

Conclusion

That brings us to the end of this article.

In this article, we have learned different ways of removing extra spaces from HTML when parsing using the BeautifulSoup library.

We hope this article has been informative. For more such interesting content, please subscribe to our email academy.