Newspaper3k Archives - Be on the Right Side of Change

Analyzing News Articles with Newspaper3k, TextBlob, and Seaborn

Craig Helstowski — Sat, 11 Sep 2021 12:42:59 +0000

In this final installment of my series on Newspaper3k, we will see the real possibilities of what we can do after scraping massive amounts of news articles. To demonstrate data will be collected from 3 popular American news websites for a full year, from September of 2020 to August of 2021. We will analyze articles about current US President Joe Biden and find out what we can learn from the charts.

We will do some basic web scraping with the help of BeautifulSoup and Newspaper3k. Then we will use TextBlob to do some sentiment analysis, store the data in a CSV file, and finally plot the data using pandas DataFrames and Seaborn.

I will also show you a cool trick that can allow you to get a massive amount of articles if you know precisely want you need to look for. Hint: Google doesn’t put requests limits on ALL pages on their site.

I have all of my code (BidenProject.ipynb) and a CSV file (combined.csv) of the article data on my Github page. Otherwise, let’s jump right in.

Video: As you go through the article, you can also watch my explainer video presenting the code snippets and more introduced here—in an easy-to-follow, step-by-step manner:

A Quick Word About Google News RSS Feeds

I’m personally a big fan of scraping RSS feeds because they allow me to continually scrape up-to-date news. In our case, however, we are looking to scrape much older data, and finding archived online news articles can be a bit of a hassle. In this case, we are lucky enough to have Google News help us.

If you do not know, if you are doing a Google News search you can receive your results in the form of an RSS feed by going to the URL and replacing ‘news.google.com/’ with ‘news.google.com/rss/’. Depending on several factors you can pull news articles older than 10 years. I can do a search for Barack Obama on CNN.com from 2008 and I get about 60 hits, assuming all of these articles have Obama included.

Even though Google puts a limit on your access, they seem to allow their RSS feeds to be accessed freely. In other words, there are no requests limits to their feeds, so you can scrape them as much as you want. We can now use Google to do the heavy lifting and find articles for us. All we need to do is scrape the links and code the data analysis.

Unfortunately, Google News RSS feeds will give you a maximum of 100 articles. Therefore, if you know your search is going to have over 100 results, you may need to fiddle around with the parameters and adjust the number of RSS feeds to scrape. There is a good resource here as well if you are interested to learn more about Google News RSS feeds.

Scraping the Articles with Newspaper3k and Producing Sentiment Scores with TextBlob

As mentioned above, we are going to scrape articles about President Biden from the following news services: ABC News, CNN, and Fox News. We will collect 100 articles per month from September 2020 to August 2021, do a sentiment analysis of each article, save the data, and then plot monthly data and do some basic analysis.

Before we begin, if this is your first introduction to Newspaper3k and text sentiment analysis, I have written articles h e re and here introducing the subject. Therefore, I will not go over the scraping part in great detail. Instead, I will show you the code and briefly review it. You can type it out in either your preferred code editor or Jupyter Notebooks however, I used Jupyter to do the entire code for this exercise.

First, we import the necessary libraries:

import requests
from newspaper import Article, ArticleException
from bs4 import BeautifulSoup
from dateutil.rrule import *
from datetime import *
from textblob import TextBlob
import nltk
import csv

nltk.download('punkt')

Now, we need to set our timeframe.

start_dates = [datetime.strftime(dt, "%Y-%m-%d") for dt in rrule(MONTHLY, dtstart=datetime(2020, 9, 1), until=datetime(2021, 8, 1))]
end_dates = [datetime.strftime(dt, "%Y-%m-%d") for dt in rrule(MONTHLY, dtstart=datetime(2020, 9, 30), bymonthday=(31, -1), bysetpos=1, until=datetime(2021, 8, 31))]
dates_list = list(zip(start_dates, end_dates))

I’m using the dateutil library to set our dates for us. datetime.rrule allows us to efficiently set our monthly intervals and get the correct number of days per month. Using list comprehension we set a list of start dates and end dates for every month. Then we use zip() to pair the appropriate start and end dates together into tuples, and then put those pairs into a big list. These will be the search terms for the start and end dates and we need them to be in the 'datetime' format shown.

I will now show you the function which takes those dates as a parameter and then will scrape the articles and perform the sentiment analysis. This section will take several hours to run.

def get_articles(dates):
    news_sites = ['foxnews.com', 'cnn.com', 'abcnews.go.com']
    data = []

    for site in news_sites:
        
        # loop through each set of dates you wish to input
        for date1, date2 in dates:
            articles = []
            
            # URL of the Google News RSS with Joe Biden in the search parameter
            main_url = "https://news.google.com/rss/search?q=joe+biden+after:" + date1 + "+before:" + 
                date2 + "+site:" + site + "&ceid=US:en&hl=en-US&gl=US"
            
            # get the results in the RSS and collect the links
            response = requests.get(main_url)
            webpage = response.content
            soup = BeautifulSoup(webpage, features="xml")
            items = soup.find_all('item')
            for item in items:
                link = item.find('link').text
                articles.append(link)

            # parse the articles and get the polarity and subjectivity scores for each in your 
            # specified time frame
            for url in articles:
                # print(url)
                # We throw this in a try/except block in case we get a bad link which would kill the program
                try:
	        # Newspaper3k to scrape the article text and title
                    article = Article(url)
                    article.download()
                    article.parse()
                    article.nlp()

                    title = article.title
                    text = article.text
                                        
                    # to make sure that the article does include Joe Biden
                    # depending on your criteria, Google will often give you completely irrelevant results
                    if 'Biden' in text:

                        # run sentiment analysis on the article text
                        # create a Textblob object and then get the sentiment values and store them
                        text_blob = TextBlob(text)
                        polarity = text_blob.polarity
                        subjectivity = text_blob.subjectivity

                        # in case we get a non-article as a link - do not include in data 
                        if polarity == 0 and subjectivity == 0:
                            pass
                        else:
                            # Save the necessary data to then put in a csv file
                            save = [site, url, title, datetime.strptime(date2, '%Y-%m-%d').strftime('%Y-%m'), polarity, subjectivity] 
                            data.append(save)
                
                # If there is a bad link, move on to the next one
                except ArticleException:
                    pass

    return(data)

# collect the article data with our specified dates
articles = get_articles(dates_list)

# write data to csv
with open('combined.csv', 'w') as csv_file:
    header = ['News Source', 'URL', 'Title', 'Month', 'Polarity', 'Subjectivity']
    writer = csv.writer(csv_file)
    writer.writerow(header)
    writer.writerows(articles)
    csv_file.close()

After passing the dates into get_articles() we collect an RSS feed for every month by putting in each start and end date, as well as the URL of the news source, as strings into the Google RSS URL string. Then we store all of those articles from the feed into a list, scrape and do sentiment analysis on each article, and then save the data into a CSV file we can store for later use.

It is necessary to put the scraping code in a try/except block in case there happens to be a bad article link in your results which would kill the program. The exception given will be an ‘ArticleException’ error when running Newspaper3k, so use that in your ‘except’ block and make sure to include that in the imports at the top.

Now that we have our data stored and ready to be analyzed, we can go ahead and plot the data.

Plotting The Data

I am going to show some simple plots you can make with Seaborn, an excellent tool to make nice-looking graphs and plots. The documentation is here yet, there are several good articles and videos on finxter.com if you want to see some more advanced implementations, like this article making heatmaps of COVID data.

First, we need to put the data into a pandas DataFrame, turn the dates in the month column from strings back into datetime objects so that the months can be properly ordered in time, and then sort the data by both month and news source (although sorting by news source is not entirely necessary). I use ‘inplace=True’ so that the DataFrame remains properly sorted throughout the entire kernel.

# set the dataframe, convert the date column from a string, and then sort so the data is plotted correctly
df = pd.read_csv('combined.csv')
df['Month'] = pd.to_datetime(df['Month']).dt.strftime("%Y-%m")
df.sort_values(by=['Month', 'News Source'], inplace=True)

Now we can plot our data.

Since the reason we plot graphs is to answer certain questions that we have about some data, let’s ask a few questions ourselves and see what the results give us. First: are ABC, CNN, and Fox News in general favorably inclined, neutral, or maybe negatively inclined towards the President? Have they been consistent in their reporting this past year? If not, did something happen that might have led to opinion changing in either direction?

Let’s plot a simple line graph over time of the average polarity per month for ABC, CNN, and Fox News.

# line chart of the average polarity per month for each source
# sns.set(rc={'figure.figsize':(15,10)})
sns.lineplot(x = 'Month', y = 'Polarity', hue='News Source', ci=None, data = df)

We can also use bar graphs if we want to visualize the data differently.

# bar chart of the average polarity per month for each source
sns.barplot(y = 'Polarity', x = 'Month', hue='News Source', ci=None, data = df)

By setting a hue, we allow the news sources to be grouped out and analyzed separately. We set ‘ci=None’ so that the confidence interval is not shown, otherwise the graphs would look messy.

The polarity is a study of how positively or negatively inclined a text is. With a range from -1 (extremely negative) to 1 (extremely positive), the polarity can potentially tell us whether or not the text is favorably inclined towards its selected topic. Having studied polarity data to analyze the positivity or negativity of the content, while the range for our data does not seem to be very broad given that the polarity can fall between -1 and 1, most articles generally fall between -0.2 and 0.3, even news outlets with a stronger political bent have their articles generally fall within this range as well. There is a good article here if you are interested to learn more about how TextBlob produces its sentiment scores.

With only these graphs, our questions are mostly answered. If you are an American, you would expect Fox News to have a less favorable opinion of a Democrat President and you would expect the opposite for CNN. ABC, in general, has managed to stay somewhat in the middle.

From September to November of last year the United States had their elections, so it would be natural for CNN, and even ABC, to throw more of their weight behind a candidate they would much rather have than ex-President Trump. On the other hand, in August of this year, the USA had its disastrous troop pullout from Afghanistan which many Americans saw as an embarrassment to the country. This is certainly reflected in the low polarity score for the month from Fox News. Even the other news outlets did not seem to like what happened.

If you are curious to analyze the distribution of polarity values for each news outlet, we can make a quick histogram for August. We can isolate all the data from August using ‘get_group()’ after grouping our entire DataFrame by month, then we can make our plot with ‘sns.histplot’, setting x to ‘polarity’, the hue to ‘News Source’ so that a separate plot appears for each outlet, and the kde (kernel density estimate) to ‘True’ to smooth the distributions and produce drawn lines for us.

aug = df.groupby(['Month']).get_group('2021-08')
# months.head()
sns.histplot(data=aug, hue='News Source', x='Polarity', kde=True)

Here is our data:

Since we selected a fairly balanced sampling of news sources, we can also make an interesting conclusion that the first year of Biden’s presidency was not particularly successful, given that there is a basic downward polarity trend among all our sources.

On the other hand, the subjectivity does not give away too much other than that maybe the charged atmosphere of the elections calmed down the reporting a little bit.

# line chart of the average subjectivity per month for each source
sns.lineplot(x = 'Month', y = 'Subjectivity', hue='News Source', ci=None, data = df)

The usual range of subjectivity scores of news articles hovers between .2 and .6 usually, so there is nothing here out of the ordinary. Although you often see much higher scores from more politically extreme news sites.

Let’s now ask one more question before I end the article. ABC tends to have a reputation of being a very balanced news source, although leaning just a little to the left politically. In what was a very contentious political climate in the USA the past year, did the integrity of their reporting on Joe Biden hold up to their perceived reputation?

Box plots might be our best choice here because the line plots we used before could not neatly show the distribution of article scores, we only saw what was the average. All we need to do is group the DataFrame by the outlet and then use get_group() to isolate the data from ABC. To make our chart look more colorful we can add a color palette, in this case, I use ‘mako.’

abc = df.groupby(['News Source']).get_group('abcnews.go.com')
sns.boxplot(x = 'Month', y = 'Polarity', data = abc, palette='mako')
sns.boxplot(x = 'Month', y = 'Subjectivity', data = abc, palette='mako')

It seems as if we can say the data confirms the opinion. Even the events in August did not have too much of an effect on the reporting of Joe Biden. If we wish, we can compare the data to Fox News on a single chart. We can concatenate together two individual DataFrames from each news outlet and chart the data.

# compare ABC and Fox polarity data side-by-side
fox = df.groupby(['News Source']).get_group('foxnews.com')
abc = df.groupby(['News Source']).get_group('abcnews.go.com')
a_f = pd.concat([abc, fox])
sns.boxplot(x = 'Month', y = 'Polarity', hue='News Source', data = a_f)

Since the Fox News data seems to be more varied we can certainly say that compared to Fox News ABC is more balanced, at least on its reporting about the current President.

Given that arguably the 3 most popular news outlets in the USA are increasingly negative on the topic of Joe Biden, we may be able to conclude that the President may need to do something to improve his popularity nationwide. However, a more in-depth analysis is required.

This concludes our series on news article scraping and analysis with the help of Newspaper3k, a tool that allows you to scrape massive amounts of news article data with only a couple of lines of code.

I have also included more graphs and code in my video above or on my Github.

The post Analyzing News Articles with Newspaper3k, TextBlob, and Seaborn appeared first on Be on the Right Side of Change.

Newspaper3k – How to Generate a Word Cloud in Python

Craig Helstowski — Sat, 28 Aug 2021 10:14:29 +0000

To carry on from our introduction to Newspaper3k, we can now take our basic knowledge and realize the possibilities of what we can do with this library.

Here I’m going to demonstrate for you a project which takes articles from a set of different news agencies, picks out the most used words from them, and shows a word cloud of the results with the help of NLP and Matplotlib.

You can check out the full code on GitHub here.

Let’s get started.

Introduction

In this article, we are going to scrape a series of articles from several different news sources and once we have extracted the keywords from each of the articles we can create a word cloud that displays the most important topics of the day from the keywords obtained from each article using Newspaper3k.

Word clouds may not be the most penetrating way to analyze text data but, can be a very engaging and simple means for analyzing text data and discovering words or common word patterns that frequently appear. For example, if you are able to get text of speeches or writings of a public figure, you can easily visualize the most important topics that are covered with a word cloud. To take it further, companies could combine this with sentiment analysis to find out which of their products are written about the most and how positively or negatively viewed they are.

For example, here is a word cloud from ‘Laudato Si’, a Vatican encyclical put out 6 years ago. The document is about 250 pages, however we can very quickly get the gist of what the encyclical is about by looking at the 100 most-used words in the paper:

Depending on the text we are analyzing, we can maybe even determine the basic theme or arguments of the paper just from looking at a word cloud. As we can see from the word cloud of this paper, we can guess that the encyclical concerns matters of the planet and humanity, that there is some sort of problem and something must be done to help the planet, maybe for the good of ‘us’ or humanity, maybe for God as well. As we see, creating a word cloud with the help of Newspaper3k and data analysis can give us a lot of information about a text in a single picture.

Now that we see the possibilities here, let’s begin to make our own word clouds.

Scrape a Set of Articles From Different News Sources

Given all of the news recently about the American troop withdrawal from Afghanistan, we will focus on news about the United States for this project. We will collect the RSS feeds from the following news sources: ABC News, NBC News, CBS News, RT News, The Guardian, and the New York Times. The link to the feeds will be in the code so you do not have to search for them.

As mentioned in our previous article, RSS feeds allow us to quickly and with great ease scrape article links, especially for today’s news. If this is your first introduction to web scraping or Newspaper3k, I encourage you to read it so you understand how RSS feeds work, how to scrape them, and the python libraries you may need to download before you begin. The video also shows you how to setup a virtual environment to run the program from the folder from which will run your code.

Let’s begin.

First we will import some necessary libraries and collect all of our feeds and put them into a list:

import requests
from bs4 import BeautifulSoup
from newspaper import Article
import csv

feeds = [
         'https://www.nbcnews.com/rss', 
         'https://www.theguardian.com/us/rss', 
         'https://www.rt.com/rss/usa/', 
         'https://abcnews.go.com/abcnews/usheadlines', 
         'https://www.cbsnews.com/latest/rss/us', 
         'https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml'
]

Now let’s scrape for the articles. Luckily for us, all of the RSS feeds here can be scraped exactly the same.

articles = []

for feed in feeds:
    response = requests.get(feed)
    webpage = response.content
    soup = BeautifulSoup(webpage, features='xml')

    # every article link will be found in an item tag
    items = soup.find_all('item')

    # extract the link
    for item in items:
        link = item.find('link').text
        articles.append(link)

In an RSS feed, every article link will be included in its own separate tag. We can simply just look for every instance of the tag and collect the link in the tag inside.

Now that we have our list of articles, it’s time to scrape each article using the Newspaper3k library. Then we will store the data in a CSV file. For this article, we will save the URL, the article keywords, and the text (in case we would like to do further analysis of the text).

Data = []

for url in articles:
    info = Article(url)
    info.download()
    info.parse()
    info.nlp()

    keywords = info.keywords
    text = info.text

    # save the URL, the keywords, and the text
    save = [url, keywords, text]
    data.append(save)  


with open('MyCSV.csv', 'w') as csv_file:
    # set the column labels for the CSV file
    label = ['URL', 'Keywords', 'Text']

    # write the data into the CSV file
    writer = csv.writer(csv_file)
    writer.writerow(label)
    writer.writerows(data)
    csv_file.close()

Display a Word Cloud From the Data

Now that we have all of our data stored, we can now let the fun begin and create our word clouds.

Before we develop our big word cloud from all of the articles, I will show you how to quickly create a word cloud from just one article.

First, if you do not have Jupyter notebooks on your computer, the following commands will install it:

conda install -c conda-forge notebook

pip install notebook

Then the following command will open up Jupyter notebooks:

jupyter notebook

Jupyter will then open up in your browser and you can begin work.

Creating Simple Word Clouds

Let’s first install again the necessary libraries. We will need pandas to read the CSV data, we are going to import the wordcloud library to create our word cloud image, and we will need Matplotlib to display the word cloud. If you need to install the three libraries this will suffice:

pip install pandas
pip install matplotlib
pip install wordcloud

Now the code:

import pandas as pd
from wordcloud import WordCloud, STOPWORDS

import matplotlib.pyplot as plt

Now what we need to do is load the data from the CSV file we just saved and create a data frame. Make sure to not load the first row as that just has the titles for each column.

# Load in the dataframe
# converters={'Keywords': eval} → to convert each row of keywords back into a list
df = pd.read_csv('MyCSV2.csv', converters={'Keywords': eval})

#if you want to see how many rows of data you have
print(df.shape[0])

# if you wish to display the data
df.head

Here is how the first couple of rows of my data look like:

Setting the converters as we did for the ‘Keywords’ column is necessary to convert each row of keywords back into a list that we originally saved into the CSV file. Without this, each row of keywords will simply be one big string that will be a pain to deal with.

Just so you know how to create a word cloud from a single article we will make one first from a single article. Since it is not necessary to print a word cloud from the keywords of a single article, we will just develop the word cloud from the text of the article.

We include a set of ‘stopwords’ already included in the Wordcloud library that remove useless words like ‘the’, ‘we’, etc. that would add clutter to the wordcloud. If you wish to add more stopwords you can type stopwords.add(“your word”).

# Start with one review:
text = df.Text[177]
# text = " ".join(i for i in keywords)
print(df.URL[177])
https://www.rt.com/usa/532248-lake-mead-water-shortage/?utm_source=rss&utm_medium=rss&utm_campaign=RSS

# stopwords is simply a set of words to be eliminated
# if STOPWORDS not manually set, then this default list will be used
stopwords = set(STOPWORDS)

#if you want to add to the stopwords list, here I add some news sources
stopwords.update(['s','t','rt','co','abc','nbc','cbs','nytimes'])

# Create and generate a basic word cloud image:
wordcloud = WordCloud(stopwords=stopwords,max_words=50,background_color="white").generate(text)

# Display the generated image:
plt.figure(figsize=[10,10])
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Here is what my word cloud looks like from this article:

As we can see, this article is about some water shortage in either Nevada or Arizona.

Now that we know how to create a basic word cloud from a single article, let’s now expand that to find the most important key topics of the day from a large set of articles.

It will not be necessary to run our analysis from the text of each article (although you can if you wish), so we will just create a word cloud from all of the keywords extracted.

First we need to gather all of the keywords, put them into a single list, and then get them all out of the list and formatted as a big text. With a little ‘python-fu’ it can easily be done:

#for row(i) in df.Keywords
#    for keyword(j) in i
#        append j
keywords = [j for i in df.Keywords for j in I]

# join all of the keywords into a single string text
text = " ".join(i for i in keywords)

#print(text)

Now we have a full text of all of our keywords ready to be made into a word cloud.

# collocation=False so that duplicate words do not appear as part of a larger phrase (like 'president Biden')
wordcloud = WordCloud(stopwords=stopwords,max_words=100,background_color="white",collocations=False).generate(text)

# Display the generated image:
plt.figure(figsize=[10,10])
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

As we can see, the most used keywords just happen to be involved in the most important topics in the United States today: Afghanistan and COVID-19. Especially events in Afghanistan are dominating the headlines.

Getting a Little Creative With Our Word Cloud

For kicks, I will show you a how you can have a little fun creating your word cloud. It does not just simply have to look like the one above.

Since we are dealing with news from the United States, why don’t we create a word cloud in the shape and color of the American flag?

First, find a picture of the flag and then save it as a PNG file. Here is the link to the image I used:

https://cdn.britannica.com/33/4833-004-828A9A84/Flag-United-States-of-America.jpg

Now we will load it into our program and create a word cloud that looks like the American flag.

You will need to pip install numpy and Pillow if you do not have it.

Now our code:

import numpy as np
from PIL import Image
from wordcloud import ImageColorGenerator

# Generate a word cloud image
frame = np.array(Image.open("usa.png"))

wordcloud = WordCloud(stopwords=stopwords, background_color="white", max_words=500, collocations=False, mask=frame).generate(text)

# create coloring from image
image_colors = ImageColorGenerator(frame)

plt.figure(figsize=[15,15])
#add the color scheme to the word cloud
plt.imshow(wordcloud.recolor(color_func=image_colors))
plt.axis("off")
plt.show()

We take the image and create a mask, or a frame, which directs where the words can go and what colors they will be based on an array of numerical values that maps to the image we put in. That mask is then included into the Wordcloud object. Finally, we then get the colors necessary which will then go into our new word cloud from ImageColorGenerator.

In this case I added a couple extra hundred words to fill in the flag a little more so you can see how the program fits the words to the shape and contours of the flag.

Here is our new ‘American’ word cloud now:

Enjoy!

If you want a link to the code and the CSV I used, I have included it on my GitHub.

The post Newspaper3k – How to Generate a Word Cloud in Python appeared first on Be on the Right Side of Change.

Newspaper3k – A Python Library For Fast Web Scraping

Craig Helstowski — Wed, 18 Aug 2021 14:06:55 +0000

Would you like to be able to scrape information from any article without having to write a completely different set of code every time?

In this post, I will show you a Python library which allows you to scrape any article using only a few lines of code. It’s called Newspaper3k.

? Video: As you go through the article, you can also watch my explainer video presenting the code snippet introduced here—in an easy-to-follow, step-by-step manner:

Why?

Let’s start by asking why scraping news or blog articles ‘en masse’ is so useful. Some of the reasons include the following:

A business wants to discover trends or to search what people are saying about their company in order to make more informed decisions.
Some individual or service wants to collect and aggregate news.
For a research project, such as discovering what news is real and what news is fake for example, researchers may need a large set of articles to collect.
A journalist could look to gather articles that back his claims or arguments made in articles he wrote.

In today’s world, there is an overwhelming amount of news available on the internet. Therefore, if you have to scrape articles it is important to know what information to look for, where to find it, and extract the information you need without wasting time. You do not need to be a professional to understand this. We all deal with content from various sources in our daily lives and we can intuit very well what we need, what we don’t need, and what attracts our curiosity enough that we may want to explore further. How much time we would waste if we did not know how to sift through all of this information!

However, if you must program a web scraper it can be a drag to have to search the HTML or CSS every time and write a new set of code for every site you need to scrape. The task is made even more difficult if the content is dynamically loaded. Wouldn’t it be much easier if you can scrape all the information you need from any article using the same couple of lines of code?

It is here where the power of Python shines again. With the Newspaper3k library, you can extract article data for almost any news service or blog with only the same few lines of code.

What is Newspaper3k?

Newspaper3k is a Python library used for scraping web articles. It utilizes the requests library and has BeautifulSoup as a dependency while it parses for lxml. Newspaper3k is not only able to scrape the entire article text for you, but can also scrape for other kinds of data such as the publish date, author(s), URL, images, and video to name a few. If you wish to simply know what the article is about without having to read the whole article, Newspaper3k can also produce a summary of the article.

After you extract the data it can then be integrated and saved into different formats such as CSV, JSON, and even pandas. Newspaper3k also works in over 30 languages.

The Newspaper3k Python library can also do more advanced functions such as discovering RSS feeds, scraping for article URLs from a main news source, and even multi-thread extraction if you have to scrape for more than one article but cannot afford to bombard a website with so many requests.

I will now show you 2 sample demonstrations using Newspaper3k. The first is a very simple article scraper. In the second demonstration, I will show you how Newspaper3k allows you to do speedy sentiment analysis on news articles.

A Simple Article Scraper Using Newspaper3k

Here I will show you how you can scrape a single article in only a couple lines of code.

To first use Newspaper3k, we must install the package first:

pip3 install Newspaper3k

Now let’s write the code. We will choose this article as our example to scrape:

5 freelance jobs where you can earn $100,000 or more during the pandemic

Let’s first extract the information and then store the data from the parsed article object into their appropriate variables:

from newspaper import Article

# create an article object
article = Article('https://www.cnbc.com/2021/02/02/5-freelance-jobs-where-you-can-earn-100000-or-more-during-pandemic.html')
article.download()
article.parse()
article.nlp()

title = article.title
link = article.url
authors = article.authors
date = article.publish_date
image = article.top_image
summary = article.summary
text = article.text

We first need to import the Article object from the Newspaper3k library and then we can extract the information. Following the order shown is necessary. We must also include the nlp() function in order for us to process the keywords from the article using Natural Language Processing (NLP) and to also summarize the article.

Now that we have the information stored, we can print out our data:

print('**********************************')
print(f'Title: {title}')
print(f'Link: {link}')
print(f'Author: {authors[0]}')
print(f'Publish Date: {date}')
print(f'Top Image: {image}')
print(f'Summary: ')
print(summary)
print('**********************************')

And the output:

Not too bad for only a couple of lines, don’t you think?

An Article Sentiment Analysis Program With Newspaper3k

Now I will show you a more expanded demonstration in which we will collect articles from a news source and then print out a summary of each article with its corresponding link and sentiment scores. The sentiment scores will display the polarity and subjectivity scores for each article.

Let’s say we are doing a sentiment analysis of articles from a particular website. In this case, we will select ABC Technology News. We first need to find a way to gather a collection of articles from the news site for us to scrape.

A very easy way to collect article links from a news source is to get its RSS feed if it is available.

What Is an RSS Feed and Why They Are Useful to Scrape

RSS stands for ‘Really Simple Syndication.” These feeds allow the content from a website to be shared and distributed to other services much easier. Users can streamline content from any news source to their content aggregator service (such as Flipboard). On the other hand, news sources can use RSS to broaden the reach of their content delivery to potentially attract more readers. RSS feeds are often included in email content delivery services as well.

RSS feeds for web scraping are incredibly useful for two reasons. First, the article links are organized and formatted in such a way that they are very easy to find and extract in comparison to a regular website. The second reason is that almost all RSS feeds have the same standard format. Therefore the same code can often be used if you wish to extract article links from more than one RSS feed.

It must be said, scraping RSS feeds is no different than scraping regular websites. Make sure you are able to legally scrape the data from an RSS feed before going ahead and doing so. Some news sources have limitations on what you can do with RSS data. Therefore, before you decide to scrape a feed make sure to go to the news site and check to see if they have any RSS policies. Once you believe it is okay to scrape the RSS feed make sure to follow proper scraping practices such as not bombarding the site with too many requests and respecting the Terms and Conditions.

Coding the Program

Step 1. Get the article links in the RSS feed.

In this case ABC Technology does have an RSS feed, so we will use it.

To parse the links from the news source we must first look at the RSS feed and locate were each article link will be. As we see, each tag has all of the details for each article in the feed. If we look under the tag, we can see the article link under the tag. This is where we will extract the links.

We can now write a quick script using requests and BeautifulSoup to scrape for each of these links. If you have no experience using BeautifulSoup and requests, there are plenty of resources here on finxter.com to get you started, including many articles about web scraping.

Here is how we will begin:

import requests
from bs4 import BeautifulSoup

feed = "https://abcnews.go.com/abcnews/technologyheadlines"

# first make a get request to the RSS feed
response = requests.get(feed)
# collect the contents of the request
webpage = response.content
# create a BeautifulSoup object that we can then parse to extract the links and title
soup = BeautifulSoup(webpage, features='xml')

# here we find every instance of an  tag, collect everything inside each tag, and store them all in a list
items = soup.find_all('item')

# extract the article link within each  tag and store in a separate list
articles = []
for item in items:
    link = item.find('link').text
    articles.append(link)

We first send a get request to the feed, and once inside, we take the content and store it in a BeautifulSoup object (here I use the ‘xml’ feature since the RSS feed is written in XML). Then we search for each tag and store the data from each instance into a list for us to further parse through. We will call this variable items.

We then loop through each element in items, take the link out,and store each it in a new list which we will call articles.

Step 2. Now, let’s extract the data in each article.

Now that we have all of the article links we can now collect the data we need from each article. We will extract the title, main keywords, summary, and text and store them each in its own separate variable:

from newspaper import Article

# extract the data from each article, perform sentiment analysis, and then print
for url in articles:
    article = Article(url)
    article.download()
    article.parse()
    article.nlp()

    # store the necessary data in variables
    title = article.title
    summary = article.summary
    keywords = article.keywords
    text = article.text

Step 3. It’s now time to do sentiment analysis.

For this section, we are going to utilize the Textblob and NLTK libraries to process and analyze text. Therefore, before we begin we must install both of the libraries. We can simply run pip install -U textblob to install Textblob.

There is no need to enter a separate command to install NLTK as installing Textblob will also automatically install NLTK along with it. If you wish, however, you can install NLTK alone using pip install nltk.

Textblob is a library that processes text and uses NLP to perform different kinds of analysis, such as sentiment analysis, classifying words into parts-of-speech, word translation, and more. It needs the Natural Language Toolkit (NLTK) library to run. It conducts sentiment analysis by averaging the scores for different word types in a text and then giving the text a polarity score and a subjectivity score. The polarity score is calculated from -1 to 1, -1 being extremely negative and 1 being extremely positive. The subjectivity score goes from 0 to 1, 0 being extremely subjective and 1 being extremely objective.

However, to conduct this analysis we need to tokenize the text in order for Textblob to actually read the text correctly. To tokenize simply means to break a text into smaller components such as words or sentences. The NLTK package will do this for us however, we need to download the ‘punkt’ package for us to do the tokenization:

from textblob import TextBlob
import nltk

nltk.download('punkt')

Now that I have explained a little what is going on behind the scenes, here is what the next section of code will look like (still in the ‘for’ loop):

for url in articles:
    ….
    # run sentiment analysis on the article text
    # create a Textblob object and then get the sentiment values and store them
    text_blob = TextBlob(text)
    polarity = text_blob.polarity
    subjectivity = text_blob.subjectivity

Step 4. Finally, we can now print out the data.

Now that we have all of the data we need, we can now print the results:

for url in articles:
    ….    
    # now we can print out the data
    print('**************************************')
    print(f'Title: {title}')
    print(f'URL: {url}')
    print(f'Keywords: {keywords}')
    print(f'Polarity: {polarity}')
    print(f'Subjectivity: {subjectivity}')
    print(f'Summary: ')
    print(summary)
    print('**************************************')

Here is what a sample of the output will look like:

If you want to take the code further and do more analysis, the possibilities are endless of what you can do. For example, you can write a quick script to select only articles above a certain subjectivity level, or you can make a comparison graph of polarity values from different sections in a news site.

For more information, I encourage you to check out the Newspaper3k documentation. There is also an excellent resource here on GitHub as well.

I have also posted the code for both programs on my Github page for you to copy if you wish. You can read my follow up article here:

Tutorial: How to Set Up a Wordcloud with Newspaper3k

The post Newspaper3k – A Python Library For Fast Web Scraping appeared first on Be on the Right Side of Change.