To carry on from our introduction to Newspaper3k, we can now take our basic knowledge and realize the possibilities of what we can do with this library.
Here I’m going to demonstrate for you a project which takes articles from a set of different news agencies, picks out the most used words from them, and shows a word cloud of the results with the help of NLP and Matplotlib.
You can check out the full code on GitHub here.
Let’s get started.
Introduction
In this article, we are going to scrape a series of articles from several different news sources and once we have extracted the keywords from each of the articles we can create a word cloud that displays the most important topics of the day from the keywords obtained from each article using Newspaper3k.
Word clouds may not be the most penetrating way to analyze text data but, can be a very engaging and simple means for analyzing text data and discovering words or common word patterns that frequently appear. For example, if you are able to get text of speeches or writings of a public figure, you can easily visualize the most important topics that are covered with a word cloud. To take it further, companies could combine this with sentiment analysis to find out which of their products are written about the most and how positively or negatively viewed they are.
For example, here is a word cloud from ‘Laudato Si’, a Vatican encyclical put out 6 years ago. The document is about 250 pages, however we can very quickly get the gist of what the encyclical is about by looking at the 100 most-used words in the paper:
Depending on the text we are analyzing, we can maybe even determine the basic theme or arguments of the paper just from looking at a word cloud. As we can see from the word cloud of this paper, we can guess that the encyclical concerns matters of the planet and humanity, that there is some sort of problem and something must be done to help the planet, maybe for the good of ‘us’ or humanity, maybe for God as well. As we see, creating a word cloud with the help of Newspaper3k and data analysis can give us a lot of information about a text in a single picture.
Now that we see the possibilities here, let’s begin to make our own word clouds.
Scrape a Set of Articles From Different News Sources
Given all of the news recently about the American troop withdrawal from Afghanistan, we will focus on news about the United States for this project. We will collect the RSS feeds from the following news sources: ABC News, NBC News, CBS News, RT News, The Guardian, and the New York Times. The link to the feeds will be in the code so you do not have to search for them.
As mentioned in our previous article, RSS feeds allow us to quickly and with great ease scrape article links, especially for today’s news. If this is your first introduction to web scraping or Newspaper3k, I encourage you to read it so you understand how RSS feeds work, how to scrape them, and the python libraries you may need to download before you begin. The video also shows you how to setup a virtual environment to run the program from the folder from which will run your code.
Let’s begin.
First we will import some necessary libraries and collect all of our feeds and put them into a list:
import requests from bs4 import BeautifulSoup from newspaper import Article import csv feeds = [ 'https://www.nbcnews.com/rss', 'https://www.theguardian.com/us/rss', 'https://www.rt.com/rss/usa/', 'https://abcnews.go.com/abcnews/usheadlines', 'https://www.cbsnews.com/latest/rss/us', 'https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml' ]
Now let’s scrape for the articles. Luckily for us, all of the RSS feeds here can be scraped exactly the same.
articles = [] for feed in feeds: response = requests.get(feed) webpage = response.content soup = BeautifulSoup(webpage, features='xml') # every article link will be found in an item tag items = soup.find_all('item') # extract the link for item in items: link = item.find('link').text articles.append(link)
In an RSS feed, every article link will be included in its own separate <item>
tag. We can simply just look for every instance of the <item>
tag and collect the link in the <link>
tag inside.
Now that we have our list of articles, it’s time to scrape each article using the Newspaper3k library. Then we will store the data in a CSV file. For this article, we will save the URL, the article keywords, and the text (in case we would like to do further analysis of the text).
Data = [] for url in articles: info = Article(url) info.download() info.parse() info.nlp() keywords = info.keywords text = info.text # save the URL, the keywords, and the text save = [url, keywords, text] data.append(save) with open('MyCSV.csv', 'w') as csv_file: # set the column labels for the CSV file label = ['URL', 'Keywords', 'Text'] # write the data into the CSV file writer = csv.writer(csv_file) writer.writerow(label) writer.writerows(data) csv_file.close()
Display a Word Cloud From the Data
Now that we have all of our data stored, we can now let the fun begin and create our word clouds.
Before we develop our big word cloud from all of the articles, I will show you how to quickly create a word cloud from just one article.
First, if you do not have Jupyter notebooks on your computer, the following commands will install it:
conda install -c conda-forge notebook
or
pip install notebook
Then the following command will open up Jupyter notebooks:
jupyter notebook
Jupyter will then open up in your browser and you can begin work.
Creating Simple Word Clouds
Let’s first install again the necessary libraries. We will need pandas to read the CSV data, we are going to import the wordcloud
library to create our word cloud image, and we will need Matplotlib to display the word cloud. If you need to install the three libraries this will suffice:
pip install pandas pip install matplotlib pip install wordcloud
Now the code:
import pandas as pd from wordcloud import WordCloud, STOPWORDS import matplotlib.pyplot as plt
Now what we need to do is load the data from the CSV file we just saved and create a data frame. Make sure to not load the first row as that just has the titles for each column.
# Load in the dataframe # converters={'Keywords': eval} → to convert each row of keywords back into a list df = pd.read_csv('MyCSV2.csv', converters={'Keywords': eval}) #if you want to see how many rows of data you have print(df.shape[0]) # if you wish to display the data df.head
Here is how the first couple of rows of my data look like:
Setting the converters as we did for the ‘Keywords’ column is necessary to convert each row of keywords back into a list that we originally saved into the CSV file. Without this, each row of keywords will simply be one big string that will be a pain to deal with.
Just so you know how to create a word cloud from a single article we will make one first from a single article. Since it is not necessary to print a word cloud from the keywords of a single article, we will just develop the word cloud from the text of the article.
We include a set of ‘stopwords’ already included in the Wordcloud library that remove useless words like ‘the’, ‘we’, etc. that would add clutter to the wordcloud. If you wish to add more stopwords you can type stopwords.add(“your word”).
# Start with one review: text = df.Text[177] # text = " ".join(i for i in keywords) print(df.URL[177]) https://www.rt.com/usa/532248-lake-mead-water-shortage/?utm_source=rss&utm_medium=rss&utm_campaign=RSS # stopwords is simply a set of words to be eliminated # if STOPWORDS not manually set, then this default list will be used stopwords = set(STOPWORDS) #if you want to add to the stopwords list, here I add some news sources stopwords.update(['s','t','rt','co','abc','nbc','cbs','nytimes']) # Create and generate a basic word cloud image: wordcloud = WordCloud(stopwords=stopwords,max_words=50,background_color="white").generate(text) # Display the generated image: plt.figure(figsize=[10,10]) plt.imshow(wordcloud) plt.axis("off") plt.show()
Here is what my word cloud looks like from this article:
As we can see, this article is about some water shortage in either Nevada or Arizona.
Now that we know how to create a basic word cloud from a single article, let’s now expand that to find the most important key topics of the day from a large set of articles.
It will not be necessary to run our analysis from the text of each article (although you can if you wish), so we will just create a word cloud from all of the keywords extracted.
First we need to gather all of the keywords, put them into a single list, and then get them all out of the list and formatted as a big text. With a little ‘python-fu’ it can easily be done:
#for row(i) in df.Keywords # for keyword(j) in i # append j keywords = [j for i in df.Keywords for j in I] # join all of the keywords into a single string text text = " ".join(i for i in keywords) #print(text)
Now we have a full text of all of our keywords ready to be made into a word cloud.
# collocation=False so that duplicate words do not appear as part of a larger phrase (like 'president Biden') wordcloud = WordCloud(stopwords=stopwords,max_words=100,background_color="white",collocations=False).generate(text) # Display the generated image: plt.figure(figsize=[10,10]) plt.imshow(wordcloud) plt.axis("off") plt.show()
As we can see, the most used keywords just happen to be involved in the most important topics in the United States today: Afghanistan and COVID-19. Especially events in Afghanistan are dominating the headlines.
Getting a Little Creative With Our Word Cloud
For kicks, I will show you a how you can have a little fun creating your word cloud. It does not just simply have to look like the one above.
Since we are dealing with news from the United States, why don’t we create a word cloud in the shape and color of the American flag?
First, find a picture of the flag and then save it as a PNG file. Here is the link to the image I used:
Now we will load it into our program and create a word cloud that looks like the American flag.
You will need to pip install numpy
and Pillow if you do not have it.
Now our code:
import numpy as np from PIL import Image from wordcloud import ImageColorGenerator # Generate a word cloud image frame = np.array(Image.open("usa.png")) wordcloud = WordCloud(stopwords=stopwords, background_color="white", max_words=500, collocations=False, mask=frame).generate(text) # create coloring from image image_colors = ImageColorGenerator(frame) plt.figure(figsize=[15,15]) #add the color scheme to the word cloud plt.imshow(wordcloud.recolor(color_func=image_colors)) plt.axis("off") plt.show()
We take the image and create a mask, or a frame, which directs where the words can go and what colors they will be based on an array of numerical values that maps to the image we put in. That mask is then included into the Wordcloud object. Finally, we then get the colors necessary which will then go into our new word cloud from ImageColorGenerator
.
In this case I added a couple extra hundred words to fill in the flag a little more so you can see how the program fits the words to the shape and contours of the flag.
Here is our new ‘American’ word cloud now:
Enjoy!
If you want a link to the code and the CSV I used, I have included it on my GitHub.