In this final installment of my series on Newspaper3k, we will see the real possibilities of what we can do after scraping massive amounts of news articles. To demonstrate data will be collected from 3 popular American news websites for a full year, from September of 2020 to August of 2021. We will analyze articles about current US President Joe Biden and find out what we can learn from the charts.
We will do some basic web scraping with the help of BeautifulSoup and Newspaper3k. Then we will use TextBlob
to do some sentiment analysis, store the data in a CSV file, and finally plot the data using pandas DataFrames and Seaborn.
I will also show you a cool trick that can allow you to get a massive amount of articles if you know precisely want you need to look for. Hint: Google doesn’t put requests limits on ALL pages on their site. 🙂
I have all of my code (BidenProject.ipynb
) and a CSV file (combined.csv
) of the article data on my Github page. Otherwise, let’s jump right in.
Video: As you go through the article, you can also watch my explainer video presenting the code snippets and more introduced here—in an easy-to-follow, step-by-step manner:
A Quick Word About Google News RSS Feeds
I’m personally a big fan of scraping RSS feeds because they allow me to continually scrape up-to-date news. In our case, however, we are looking to scrape much older data, and finding archived online news articles can be a bit of a hassle. In this case, we are lucky enough to have Google News help us.
If you do not know, if you are doing a Google News search you can receive your results in the form of an RSS feed by going to the URL and replacing ‘news.google.com/’ with ‘news.google.com/rss/’. Depending on several factors you can pull news articles older than 10 years. I can do a search for Barack Obama on CNN.com from 2008 and I get about 60 hits, assuming all of these articles have Obama included.
Even though Google puts a limit on your access, they seem to allow their RSS feeds to be accessed freely. In other words, there are no requests limits to their feeds, so you can scrape them as much as you want. We can now use Google to do the heavy lifting and find articles for us. All we need to do is scrape the links and code the data analysis.
Unfortunately, Google News RSS feeds will give you a maximum of 100 articles. Therefore, if you know your search is going to have over 100 results, you may need to fiddle around with the parameters and adjust the number of RSS feeds to scrape. There is a good resource here as well if you are interested to learn more about Google News RSS feeds.
Scraping the Articles with Newspaper3k and Producing Sentiment Scores with TextBlob
As mentioned above, we are going to scrape articles about President Biden from the following news services: ABC News, CNN, and Fox News. We will collect 100 articles per month from September 2020 to August 2021, do a sentiment analysis of each article, save the data, and then plot monthly data and do some basic analysis.
Before we begin, if this is your first introduction to Newspaper3k and text sentiment analysis, I have written articles here and here introducing the subject. Therefore, I will not go over the scraping part in great detail. Instead, I will show you the code and briefly review it. You can type it out in either your preferred code editor or Jupyter Notebooks however, I used Jupyter to do the entire code for this exercise.
First, we import the necessary libraries:
import requests from newspaper import Article, ArticleException from bs4 import BeautifulSoup from dateutil.rrule import * from datetime import * from textblob import TextBlob import nltk import csv nltk.download('punkt')
Now, we need to set our timeframe.
start_dates = [datetime.strftime(dt, "%Y-%m-%d") for dt in rrule(MONTHLY, dtstart=datetime(2020, 9, 1), until=datetime(2021, 8, 1))] end_dates = [datetime.strftime(dt, "%Y-%m-%d") for dt in rrule(MONTHLY, dtstart=datetime(2020, 9, 30), bymonthday=(31, -1), bysetpos=1, until=datetime(2021, 8, 31))] dates_list = list(zip(start_dates, end_dates))
I’m using the dateutil library to set our dates for us. datetime.rrule
allows us to efficiently set our monthly intervals and get the correct number of days per month. Using list comprehension we set a list of start dates and end dates for every month. Then we use zip()
to pair the appropriate start and end dates together into tuples, and then put those pairs into a big list. These will be the search terms for the start and end dates and we need them to be in the 'datetime'
format shown.
I will now show you the function which takes those dates as a parameter and then will scrape the articles and perform the sentiment analysis. This section will take several hours to run.
def get_articles(dates): news_sites = ['foxnews.com', 'cnn.com', 'abcnews.go.com'] data = [] for site in news_sites: # loop through each set of dates you wish to input for date1, date2 in dates: articles = [] # URL of the Google News RSS with Joe Biden in the search parameter main_url = "https://news.google.com/rss/search?q=joe+biden+after:" + date1 + "+before:" + date2 + "+site:" + site + "&ceid=US:en&hl=en-US&gl=US" # get the results in the RSS and collect the links response = requests.get(main_url) webpage = response.content soup = BeautifulSoup(webpage, features="xml") items = soup.find_all('item') for item in items: link = item.find('link').text articles.append(link) # parse the articles and get the polarity and subjectivity scores for each in your # specified time frame for url in articles: # print(url) # We throw this in a try/except block in case we get a bad link which would kill the program try: # Newspaper3k to scrape the article text and title article = Article(url) article.download() article.parse() article.nlp() title = article.title text = article.text # to make sure that the article does include Joe Biden # depending on your criteria, Google will often give you completely irrelevant results if 'Biden' in text: # run sentiment analysis on the article text # create a Textblob object and then get the sentiment values and store them text_blob = TextBlob(text) polarity = text_blob.polarity subjectivity = text_blob.subjectivity # in case we get a non-article as a link - do not include in data if polarity == 0 and subjectivity == 0: pass else: # Save the necessary data to then put in a csv file save = [site, url, title, datetime.strptime(date2, '%Y-%m-%d').strftime('%Y-%m'), polarity, subjectivity] data.append(save) # If there is a bad link, move on to the next one except ArticleException: pass return(data) # collect the article data with our specified dates articles = get_articles(dates_list) # write data to csv with open('combined.csv', 'w') as csv_file: header = ['News Source', 'URL', 'Title', 'Month', 'Polarity', 'Subjectivity'] writer = csv.writer(csv_file) writer.writerow(header) writer.writerows(articles) csv_file.close()
After passing the dates into get_articles()
we collect an RSS feed for every month by putting in each start and end date, as well as the URL of the news source, as strings into the Google RSS URL string. Then we store all of those articles from the feed into a list, scrape and do sentiment analysis on each article, and then save the data into a CSV file we can store for later use.
It is necessary to put the scraping code in a try/except block in case there happens to be a bad article link in your results which would kill the program. The exception given will be an ‘ArticleException’ error when running Newspaper3k, so use that in your ‘except’ block and make sure to include that in the imports at the top.
Now that we have our data stored and ready to be analyzed, we can go ahead and plot the data.
Plotting The Data
I am going to show some simple plots you can make with Seaborn, an excellent tool to make nice-looking graphs and plots. The documentation is here yet, there are several good articles and videos on finxter.com if you want to see some more advanced implementations, like this article making heatmaps of COVID data.
First, we need to put the data into a pandas DataFrame, turn the dates in the month column from strings back into datetime
objects so that the months can be properly ordered in time, and then sort the data by both month and news source (although sorting by news source is not entirely necessary). I use ‘inplace=True
’ so that the DataFrame remains properly sorted throughout the entire kernel.
# set the dataframe, convert the date column from a string, and then sort so the data is plotted correctly df = pd.read_csv('combined.csv') df['Month'] = pd.to_datetime(df['Month']).dt.strftime("%Y-%m") df.sort_values(by=['Month', 'News Source'], inplace=True)
Now we can plot our data.
Since the reason we plot graphs is to answer certain questions that we have about some data, let’s ask a few questions ourselves and see what the results give us. First: are ABC, CNN, and Fox News in general favorably inclined, neutral, or maybe negatively inclined towards the President? Have they been consistent in their reporting this past year? If not, did something happen that might have led to opinion changing in either direction?
Let’s plot a simple line graph over time of the average polarity per month for ABC, CNN, and Fox News.
# line chart of the average polarity per month for each source # sns.set(rc={'figure.figsize':(15,10)}) sns.lineplot(x = 'Month', y = 'Polarity', hue='News Source', ci=None, data = df)
We can also use bar graphs if we want to visualize the data differently.
# bar chart of the average polarity per month for each source sns.barplot(y = 'Polarity', x = 'Month', hue='News Source', ci=None, data = df)
By setting a hue, we allow the news sources to be grouped out and analyzed separately. We set ‘ci=None
’ so that the confidence interval is not shown, otherwise the graphs would look messy.
The polarity is a study of how positively or negatively inclined a text is. With a range from -1 (extremely negative) to 1 (extremely positive), the polarity can potentially tell us whether or not the text is favorably inclined towards its selected topic. Having studied polarity data to analyze the positivity or negativity of the content, while the range for our data does not seem to be very broad given that the polarity can fall between -1 and 1, most articles generally fall between -0.2 and 0.3, even news outlets with a stronger political bent have their articles generally fall within this range as well. There is a good article here if you are interested to learn more about how TextBlob
produces its sentiment scores.
With only these graphs, our questions are mostly answered. If you are an American, you would expect Fox News to have a less favorable opinion of a Democrat President and you would expect the opposite for CNN. ABC, in general, has managed to stay somewhat in the middle.
From September to November of last year the United States had their elections, so it would be natural for CNN, and even ABC, to throw more of their weight behind a candidate they would much rather have than ex-President Trump. On the other hand, in August of this year, the USA had its disastrous troop pullout from Afghanistan which many Americans saw as an embarrassment to the country. This is certainly reflected in the low polarity score for the month from Fox News. Even the other news outlets did not seem to like what happened.
If you are curious to analyze the distribution of polarity values for each news outlet, we can make a quick histogram for August. We can isolate all the data from August using ‘get_group()
’ after grouping our entire DataFrame
by month, then we can make our plot with ‘sns.histplot
’, setting x
to ‘polarity
’, the hue to ‘News Source
’ so that a separate plot appears for each outlet, and the kde (kernel density estimate) to ‘True
’ to smooth the distributions and produce drawn lines for us.
aug = df.groupby(['Month']).get_group('2021-08') # months.head() sns.histplot(data=aug, hue='News Source', x='Polarity', kde=True)
Here is our data:
Since we selected a fairly balanced sampling of news sources, we can also make an interesting conclusion that the first year of Biden’s presidency was not particularly successful, given that there is a basic downward polarity trend among all our sources.
On the other hand, the subjectivity does not give away too much other than that maybe the charged atmosphere of the elections calmed down the reporting a little bit.
# line chart of the average subjectivity per month for each source sns.lineplot(x = 'Month', y = 'Subjectivity', hue='News Source', ci=None, data = df)
The usual range of subjectivity scores of news articles hovers between .2 and .6 usually, so there is nothing here out of the ordinary. Although you often see much higher scores from more politically extreme news sites.
Let’s now ask one more question before I end the article. ABC tends to have a reputation of being a very balanced news source, although leaning just a little to the left politically. In what was a very contentious political climate in the USA the past year, did the integrity of their reporting on Joe Biden hold up to their perceived reputation?
Box plots might be our best choice here because the line plots we used before could not neatly show the distribution of article scores, we only saw what was the average. All we need to do is group the DataFrame by the outlet and then use get_group()
to isolate the data from ABC. To make our chart look more colorful we can add a color palette, in this case, I use ‘mako
.’
abc = df.groupby(['News Source']).get_group('abcnews.go.com') sns.boxplot(x = 'Month', y = 'Polarity', data = abc, palette='mako') sns.boxplot(x = 'Month', y = 'Subjectivity', data = abc, palette='mako')
It seems as if we can say the data confirms the opinion. Even the events in August did not have too much of an effect on the reporting of Joe Biden. If we wish, we can compare the data to Fox News on a single chart. We can concatenate together two individual DataFrames from each news outlet and chart the data.
# compare ABC and Fox polarity data side-by-side fox = df.groupby(['News Source']).get_group('foxnews.com') abc = df.groupby(['News Source']).get_group('abcnews.go.com') a_f = pd.concat([abc, fox]) sns.boxplot(x = 'Month', y = 'Polarity', hue='News Source', data = a_f)
Since the Fox News data seems to be more varied we can certainly say that compared to Fox News ABC is more balanced, at least on its reporting about the current President.
Given that arguably the 3 most popular news outlets in the USA are increasingly negative on the topic of Joe Biden, we may be able to conclude that the President may need to do something to improve his popularity nationwide. However, a more in-depth analysis is required.
This concludes our series on news article scraping and analysis with the help of Newspaper3k, a tool that allows you to scrape massive amounts of news article data with only a couple of lines of code.
I have also included more graphs and code in my video above or on my Github.