How to Create Word Clouds Using Python?

You may have already learned how to analyze quantitative data using graphs such as bar charts and histograms.

But do you know how to study textual data?

One way to analyze textual information is by using a word cloud:

Figure 0: Word cloud you’ll learn how to create in this article.

This word cloud was generated by the following code discussed in the remaining article:

import pandas as pd
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

path = "/Users/mohamedthoufeeq/Downloads/DisneylandReviews.csv"
df = pd.read_csv(path, encoding='ISO-8859-1')
STOPWORDS.update(['Disneyland', 'went', 'will',
                  'go', 'park', 'day', 'one'])

wordcloud = WordCloud(width = 350,
                      height = 350,
                      max_words = 1000,
                      min_font_size = 5,
                      max_font_size = 200,
                      stopwords = STOPWORDS,
                      background_color="white").generate(
                          ' '.join(df['Review_Text']))

plt.imshow(wordcloud)
plt.axis("off")
plt.show()

In the subsequent article, I’ll show you how this code works in an easy-to-follow, step-by-step manner. Let’s get started!

There are many ways to create word clouds, but we will use the WordCloud library in this blog post. WordCloud is a Python library that makes word clouds from text files.

What Are Word Clouds?

πŸ’¬ Definition: A word cloud (also known as a tag cloud) is a visual representation of the words that appear most frequently in a given text. They can be used to summarize large bodies of text or to visualize the sentiment of a document.

A word cloud is a graphical representation of text data in which the size of each word is proportional to the number of times it appears in the text.

They can be used to visualize the most critical words in a document quickly or to get an overview of the sentiment of a piece of text.

There are word clouds apps such as Wordle, but in this blog post, we will show how to create word clouds using the Python library WordCloud.

What’s the WordCloud Library in Python?

The WordCloud library is open source and easy to use to create word clouds in Python.

It allows you to create word clouds in various formats, including PDF, SVG, and image files.

In addition, it provides several options for customizing your word clouds, including the ability to control the font, color, and layout.

You can install it using the following command in your terminal (without the $ symbol):

$ pip install wordcloud

Related Article:

Where Are Word Clouds Used?

Word clouds are a fun and easy way to visualize data.

By displaying the most common words in a given text, they can provide insights into the overall themes and tone of the text.

  • Word clouds can be used for various purposes, from educational to marketing.
  • They can use word clouds for vocabulary building and text analysis in the classroom.
  • You can also use word clouds to generate leads or track customer sentiment.
  • For businesses, word clouds can be used to create marketing materials, such as blog posts, infographics, and social media content.
  • Word clouds can also monitor customer feedback or identify negative sentiment.
  • Students can also use word Clouds to engage in an analysis of a piece of text. By visually highlighting the most important words, Word Clouds can help students to identify the main ideas and make connections between different concepts.

Pros of Word Clouds

The advantages of using word clouds are:

First, you can use them to summarize a large body of text quickly and easily. Identifying the most frequently used words in a text can provide a quick overview of the main points.

Second, with word clouds, you can quickly visualize the sentiment in a document. The size and placement of words in the Word Cloud can give you insights into the overall tone of the document. This tool is handy when analyzing a large body of text, such as customer feedback or reviews.

Third, word clouds can be a valuable tool for identifying the most critical keywords in a text. By analyzing the distribution of words, you can quickly identify which terms are most prominent. The word clouds can be beneficial when monitoring changing trends or assessing the overall importance.

Fourth, word clouds can be used to create designs that incorporate both visual and textual elements. By blending words and images, word clouds can add another layer of meaning to an already exciting design.

How to Create Word Clouds in Python?

We will be using Disneyland reviews downloaded from Kaggle to create a word cloud data visualization. 

You can download the file from here.

In this file, we will be focussing on the Review_Text column for creating a word cloud. You can ignore other columns.

First, you have to install the WordCloud Python library. You can do this by running the following command in a terminal:

pip install wordcloud

Once you have installed WordCloud, you must import pandas, matplotlib.pyplot, and wordcloud libraries.

import pandas as pd
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

The pandas library reads the Disneyland reviews CSV file into a data frame.

We will show you the use of STOPWORDS in the upcoming section.

The data frame variable β€œdf” stores the data from the disneylandreviews.csv file with the following command.

df = pd.read_csv("/Users/mohamedthoufeeq/Downloads/DisneylandReviews.csv")

Now run the program and see the output.

You get the following Unicode decode error.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf4 in position 121844: invalid continuation byte

The Unicode decode error means that the string could not be properly decoded into UTF-8. This can happen when a file is downloaded from the Kaggle, and it is not in the correct encoding format.

To solve this problem, you need to specify the encoding format for the file. You can type the following command in a terminal:

df = pd.read_csv("/Users/mohamedthoufeeq/Downloads/DisneylandReviews.csv",encoding='ISO-8859-1')

The encoding = 'ISO-8859-1' tells pandas that the file is in the ISO-8859-1 encoding format.

Next, create a word cloud using the WordCloud Python library.

wordcloud = WordCloud().generate(['Review_Text'])

In this above code, WordCloud().generate() is used to create a word cloud object.

The generate() function takes a list of strings as input. The list we are interested in is Review_Text which contains reviews about Disney Land. The words from the review you want to appear in your word cloud.

Go ahead and run the code.

You get again following error.

TypeError: expected string or bytes-like object

The type error means that the word cloud object expects a string or a bytes-like object. But the data type is Pandas series.

To solve this, You have to type following command

wordcloud = WordCloud().generate(' '.join(df['Review_Text']))

The above command converts the series to strings data type.

plt.imshow(wordcloud)

The plt.imshow() call will create a word cloud image in 2D.

Then remove the axis with the following command:

plt.axis("off")

The "off" parameter removes the axis from the plot.

Finally, the below commands displays the image of the word cloud.

plt.show()

Once run the program you will see a word cloud image as shown below:

Figure 1. 

The word "Park" is bigger, representing that this word appears more in reviews.

But there are words such as "Disneyland", "went", "will", "park", "go", "day", and "One" that are unrelated for analysis.

So we can exclude them from the word cloud with the following command using the stopwords parameter.

STOPWORDS.update(['Disneyland', 'went','will,'go',"park", "day","one"])
wordcloud = WordCloud(stopwords = STOPWORDS).generate(' '.join(df['Review_Text']))

STOPWORDS will remove all the defined words from the text before creating the word cloud. The word cloud function inserts the STOPWORDS parameter.

Now re-run the program, and you will get the following word cloud image.

Figure 2. 

Before we can analyze the words, let us see how to customize the words’ appearance.

You can also customize the appearance of your word cloud by changing the font size and background color.

The maximum font size can be set with the max_font_size option, and the minimum font size can be set with the min_font_size option. The background color of the word cloud can be set with the background_color option.

wordcloud = WordCloud(min_font_size = 10, max_font_size = 70, 
                      stopwords = STOPWORDS, background_color="white").generate(' '.join(df['Review_Text']))

The code sets the font size to a minimum of 10 points and a maximum of 70 points, and the background color to white.

Re-run the program, and you will get the following word cloud image.

Figure 3. 

Also, you can set the maximum amount of words to be generated using the max_words parameter.

wordcloud = WordCloud(min_font_size = 5, max_font_size = 100,
                      max_words = 1000, stopwords = STOPWORDS, background_color="white").generate(' '.join(df['Review_Text']))

The above code sets the maximum number of words generated in the word cloud to 1000. Also, change the font size to 5 and 100.

Re-run the program, and you will get the following word cloud.

Figure 4. 

As you can see, when you increase the number of words to 1000, the words that are repeated more in the reviews are shown in a larger size.

This makes it easier to find out which words are prominent. In this word cloud, you can see that "ride" is the largest word.

You set width and height  of the word cloud image.

wordcloud = WordCloud(width=350, height=350, min_font_size=5,
                      max_font_size=100, max_words=1000,
                      stopwords=STOPWORDS, background_color="white").generate(' '.join(df['Review_Text']))

The above code sets the width and height of the word cloud to 350.

Re-run the program, and you will get the following word cloud image.

Figure 5. 

Now let’s analyze the word cloud to get some insights.

The word "ride" appears large in the word cloud as it is the most frequent word in the text. Most people like to ride in Disneyland, which is reflected in the word cloud. 

Next, the word "attraction" is also popular. It shows that people are attracted to the rides and attractions in Disneyland. 

Also, the word "time" appears frequently. The word indicates that people spend a lot of time in Disneyland. 

Staffs of Disney land were very lovely. It is reflected in the word cloud as the word "nice" appears frequently. From the reviews, we can see that there are more queues and people are waiting for a long time, which is also reflected in the word cloud.

The words "lines" and "queue" are also more prominent words in the text.

But the word "hotel" is not popular in the text and represents that people do not prefer to stay in the hotel and go back home after spending the whole day in Disneyland.

πŸ’¬ Exercise: You can get more insights by analyzing the word cloud data. Try it out!

Summary

Word clouds are a great way to summarize large bodies of text or visualize a document’s sentiment.

Word clouds are a great way to understand large bodies of text and can be used for various purposes.

This blog post showed how to create word clouds using the Python library WordCloud.

We also discussed how to customize the appearance of the word cloud and analyzed the word cloud data to get insights into the text.

What do you use?