BeautifulSoup is a library used for parsing web pages.
As the library is simple to access and use, it is extensively used by the developers for Web Scraping activities. If the webpage is in the HTML format, we can parse the webpage using an HTML parser. After parsing the document, we can filter only the required tags and fetch the data.
However, it is important to note that if there are any spaces in the HTML document, they will be printed as it is. Consider the following example. This is a list of comments on a userβs posts on a social media platform.
<div> <li><span class="Mr508"> This post is so informative! </span></li> <li><span class="Mr508"> Informative </span></li> <li><span class="Mr508"> Thanks for posting </span></li> </div>
Fetching Text Values Without Spaces
If you notice carefully, there are a lot of extra spaces. When you fetch the value, the extra spaces will also be present. Refer to the below code snippet for details:
from bs4 import BeautifulSoup import re html=""" <div> <li><span class="Mr508"> This post is so informative! </span></li> <li><span class="Mr508"> Informative </span></li> <li><span class="Mr508"> Thanks for posting </span></li> </div> """ soup=BeautifulSoup(html,'html.parser') output=soup.find_all('div') for ele in output: print(ele.text)
Output:
This post is so informative! Informative Thanks for posting
Now, how do we remove the extra spaces from the value?
In todayβs article, letβs discuss different ways of removing extra whitespaces from the HTML document.
Method 1: Using str.strip()
The simplest way of removing extra spaces from the HTML is by using the str.strip()
:
soup=BeautifulSoup(html,'html.parser') output=soup.find_all('li') for ele in output: print(ele.text.strip())
Output:
This post is so informative! Informative Thanks for posting
Method 2: Using stripped_strings
Beautiful Soup supports a string generator object called stripped_strings
, that when called on the soup
element, removes all the extra spaces.
Refer to the below example for more details.
soup=BeautifulSoup(html,'html.parser') output=soup.find('div') for ele in output.stripped_strings: print(ele)
Output:
This post is so informative! Informative Thanks for posting
However, note that stripped_strings
can be called only on string objects. If we were to use find_all('li')
in the above example, it would return a list object. Calling stripped_strings
on a list object would result in an error as shown below.
soup=BeautifulSoup(html,'html.parser') output=soup.find_all('li') for ele in output.stripped_strings: print(ele)
Output:
Traceback (most recent call last): File "C:\Users\paian\PycharmProjects\Finxter\venv\Solutions\How to remove white spaces using beautiful soup.py", line 18, in <module> for ele in output.stripped_strings: File "C:\Users\paian\PycharmProjects\Finxter\venv\lib\site-packages\bs4\element.py", line 2253, in __getattr__ raise AttributeError( AttributeError: ResultSet object has no attribute 'stripped_strings'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Fetching Both Tags and Values Without Spaces
At times, we might be interested in fetching the portion of the HTML document as it is without any extra spaces.
That, is, from the above example, we might need all the elements from the div
tag, but without unnecessary extra spaces as shown below.
<div><li><span class="Mr508">This post is so informative!</span></li><li><span class="Mr508">Informative</span></li><li><span class="Mr508">Thanks for posting</span></li></div>
We can use one of the below ways to achieve this.
Method 1: Using str.strip()
We can use the str.strip()
method to get rid of extra spaces from the HTML document as shown below.
soup=BeautifulSoup(html,'html.parser') output=soup.find('div') # Method 1 - Using strings html_string=[] for ele in str(output).split("\n"): html_string.append(ele.strip()) #merge the list to a string print("".join(html_string))
Output:
<div><li><span class="Mr508">This post is so informative!</span></li><li><span class="Mr508">Informative</span></li><li><span class="Mr508">Thanks for posting</span></li></div>
Alternatively, we can also use list comprehensions to achieve the same.
soup=BeautifulSoup(html,'html.parser') output=soup.find('div') # Method 1 - Using strings print("".join([ele.strip() for ele in str(output).split("\n")]))
Output:
<div><li><span class="Mr508">This post is so informative!</span></li><li><span class="Mr508">Informative</span></li><li><span class="Mr508">Thanks for posting</span></li></div>
Method 2: Using regular expressions
We can also remove the white spaces in HTML using the regular expressions.
The expression,
[\n]
matches all the new-line characters in the string .[\ ]{2,}
matches two or or more spaces in the string.
We can replace these with an empty character. Thus removing the extra spaces in the document.
soup=BeautifulSoup(html,'html.parser') output=soup.find('div') pattern=re.compile("([\n])|([\ ]{2,})") print(re.sub(pattern,'',str(output)))
Output:
<div><li><span class="Mr508">This post is so informative!</span></li><li><span class="Mr508">Informative</span></li><li><span class="Mr508">Thanks for posting</span></li></div>
Conclusion
That brings us to the end of this article.
In this article, we have learned different ways of removing extra spaces from HTML when parsing using the BeautifulSoup library.
We hope this article has been informative. For more such interesting content, please subscribe to our email academy.