Newspaper3k – A Python Library For Fast Web Scraping

Would you like to be able to scrape information from any article without having to write a completely different set of code every time?

In this post, I will show you a Python library which allows you to scrape any article using only a few lines of code.  It’s called Newspaper3k.

📹 Video: As you go through the article, you can also watch my explainer video presenting the code snippet introduced here—in an easy-to-follow, step-by-step manner:

Why?

Let’s start by asking why scraping news or blog articles ‘en masse’ is so useful.  Some of the reasons include the following:

  • A business wants to discover trends or to search what people are saying about their company in order to make more informed decisions.
  • Some individual or service wants to collect and aggregate news.
  • For a research project, such as discovering what news is real and what news is fake for example, researchers may need a large set of articles to collect.
  • A journalist could look to gather articles that back his claims or arguments made in articles he wrote.

In today’s world, there is an overwhelming amount of news available on the internet. Therefore, if you have to scrape articles it is important to know what information to look for, where to find it, and extract the information you need without wasting time. You do not need to be a professional to understand this. We all deal with content from various sources in our daily lives and we can intuit very well what we need, what we don’t need, and what attracts our curiosity enough that we may want to explore further. How much time we would waste if we did not know how to sift through all of this information!

However, if you must program a web scraper it can be a drag to have to search the HTML or CSS every time and write a new set of code for every site you need to scrape. The task is made even more difficult if the content is dynamically loaded. Wouldn’t it be much easier if you can scrape all the information you need from any article using the same couple of lines of code?

It is here where the power of Python shines again. With the Newspaper3k library, you can extract article data for almost any news service or blog with only the same few lines of code.

What is Newspaper3k?

Newspaper3k is a Python library used for scraping web articles. It utilizes the requests library and has BeautifulSoup as a dependency while it parses for lxml. Newspaper3k is not only able to scrape the entire article text for you, but can also scrape for other kinds of data such as the publish date, author(s), URL, images, and video to name a few. If you wish to simply know what the article is about without having to read the whole article, Newspaper3k can also produce a summary of the article.

After you extract the data it can then be integrated and saved into different formats such as CSV, JSON, and even pandas.  Newspaper3k also works in over 30 languages.

The Newspaper3k Python library can also do more advanced functions such as discovering RSS feeds, scraping for article URLs from a main news source, and even multi-thread extraction if you have to scrape for more than one article but cannot afford to bombard a website with so many requests.

I will now show you 2 sample demonstrations using Newspaper3k.  The first is a very simple article scraper.  In the second demonstration, I will show you how Newspaper3k allows you to do speedy sentiment analysis on news articles.

A Simple Article Scraper Using Newspaper3k

Here I will show you how you can scrape a single article in only a couple lines of code.

To first use Newspaper3k, we must install the package first:

pip3 install Newspaper3k

Now let’s write the code.  We will choose this article as our example to scrape:

5 freelance jobs where you can earn $100,000 or more during the pandemic

Let’s first extract the information and then store the data from the parsed article object into their appropriate variables:

from newspaper import Article

# create an article object
article = Article('https://www.cnbc.com/2021/02/02/5-freelance-jobs-where-you-can-earn-100000-or-more-during-pandemic.html')
article.download()
article.parse()
article.nlp()

title = article.title
link = article.url
authors = article.authors
date = article.publish_date
image = article.top_image
summary = article.summary
text = article.text

We first need to import the Article object from the Newspaper3k library and then we can extract the information.  Following the order shown is necessary.  We must also include the nlp() function in order for us to process the keywords from the article using Natural Language Processing (NLP) and to also summarize the article.

Now that we have the information stored, we can print out our data:

print('**********************************')
print(f'Title: {title}')
print(f'Link: {link}')
print(f'Author: {authors[0]}')
print(f'Publish Date: {date}')
print(f'Top Image: {image}')
print(f'Summary: ')
print(summary)
print('**********************************')

And the output:

Not too bad for only a couple of lines, don’t you think?

An Article Sentiment Analysis Program With Newspaper3k

Now I will show you a more expanded demonstration in which we will collect articles from a news source and then print out a summary of each article with its corresponding link and sentiment scores.  The sentiment scores will display the polarity and subjectivity scores for each article.

Let’s say we are doing a sentiment analysis of articles from a particular website.  In this case, we will select ABC Technology News.  We first need to find a way to gather a collection of articles from the news site for us to scrape.

A very easy way to collect article links from a news source is to get its RSS feed if it is available. 

What Is an RSS Feed and Why They Are Useful to Scrape

RSS stands for ‘Really Simple Syndication.”  These feeds allow the content from a website to be shared and distributed to other services much easier.  Users can streamline content from any news source to their content aggregator service (such as Flipboard).  On the other hand, news sources can use RSS to broaden the reach of their content delivery to potentially attract more readers.  RSS feeds are often included in email content delivery services as well.

RSS feeds for web scraping are incredibly useful for two reasons.  First, the article links are organized and formatted in such a way that they are very easy to find and extract in comparison to a regular website. The second reason is that almost all RSS feeds have the same standard format.  Therefore the same code can often be used if you wish to extract article links from more than one RSS feed. 

It must be said, scraping RSS feeds is no different than scraping regular websites.  Make sure you are able to legally scrape the data from an RSS feed before going ahead and doing so.  Some news sources have limitations on what you can do with RSS data.  Therefore, before you decide to scrape a feed make sure to go to the news site and check to see if they have any RSS policies.  Once you believe it is okay to scrape the RSS feed make sure to follow proper scraping practices such as not bombarding the site with too many requests and respecting the Terms and Conditions. 

Coding the Program

Step 1. Get the article links in the RSS feed.

In this case ABC Technology does have an RSS feed, so we will use it.

To parse the links from the news source we must first look at the RSS feed and locate were each article link will be.  As we see, each <item> tag has all of the details for each article in the feed.  If we look under the <item> tag, we can see the article link under the <link> tag.  This is where we will extract the links.

We can now write a quick script using requests and BeautifulSoup to scrape for each of these links.  If you have no experience using BeautifulSoup and requests, there are plenty of resources here on finxter.com to get you started, including many articles about web scraping.

Here is how we will begin:

import requests
from bs4 import BeautifulSoup

feed = "https://abcnews.go.com/abcnews/technologyheadlines"

# first make a get request to the RSS feed
response = requests.get(feed)
# collect the contents of the request
webpage = response.content
# create a BeautifulSoup object that we can then parse to extract the links and title
soup = BeautifulSoup(webpage, features='xml')

# here we find every instance of an <item> tag, collect everything inside each tag, and store them all in a list
items = soup.find_all('item')

# extract the article link within each <item> tag and store in a separate list
articles = []
for item in items:
    link = item.find('link').text
    articles.append(link)

We first send a get request to the feed, and once inside, we take the content and store it in a BeautifulSoup object (here I use the ‘xml’ feature since the RSS feed is written in XML).  Then we search for each <item> tag and store the data from each <item> instance into a list for us to further parse through.  We will call this variable items.

We then loop through each element in items, take the link out,and store each it in a new list which we will call articles.

Step 2. Now, let’s extract the data in each article.

Now that we have all of the article links we can now collect the data we need from each article.  We will extract the title, main keywords, summary, and text and store them each in its own separate variable:

from newspaper import Article

# extract the data from each article, perform sentiment analysis, and then print
for url in articles:
    article = Article(url)
    article.download()
    article.parse()
    article.nlp()

    # store the necessary data in variables
    title = article.title
    summary = article.summary
    keywords = article.keywords
    text = article.text

Step 3. It’s now time to do sentiment analysis.

For this section, we are going to utilize the Textblob and NLTK libraries to process and analyze text.  Therefore, before we begin we must install both of the libraries.  We can simply run pip install -U textblob to install Textblob. 

There is no need to enter a separate command to install NLTK as installing Textblob will also automatically install NLTK along with it.  If you wish, however, you can install NLTK alone using pip install nltk.    

Textblob is a library that processes text and uses NLP to perform different kinds of analysis, such as sentiment analysis, classifying words into parts-of-speech, word translation, and more.  It needs the Natural Language Toolkit (NLTK) library to run.  It conducts sentiment analysis by averaging the scores for different word types in a text and then giving the text a polarity score and a subjectivity score.  The polarity score is calculated from -1 to 1, -1 being extremely negative and 1 being extremely positive.  The subjectivity score goes from 0 to 1, 0 being extremely subjective and 1 being extremely objective.

However, to conduct this analysis we need to tokenize the text in order for Textblob to actually read the text correctly.  To tokenize simply means to break a text into smaller components such as words or sentences.  The NLTK package will do this for us however, we need to download the ‘punkt’ package for us to do the tokenization: 

from textblob import TextBlob
import nltk

nltk.download('punkt')

Now that I have explained a little what is going on behind the scenes, here is what the next section of code will look like (still in the ‘for’ loop):

for url in articles:
    ….
    # run sentiment analysis on the article text
    # create a Textblob object and then get the sentiment values and store them
    text_blob = TextBlob(text)
    polarity = text_blob.polarity
    subjectivity = text_blob.subjectivity

            Step 4. Finally, we can now print out the data.

Now that we have all of the data we need, we can now print the results:

for url in articles:
    ….    
    # now we can print out the data
    print('**************************************')
    print(f'Title: {title}')
    print(f'URL: {url}')
    print(f'Keywords: {keywords}')
    print(f'Polarity: {polarity}')
    print(f'Subjectivity: {subjectivity}')
    print(f'Summary: ')
    print(summary)
    print('**************************************')

Here is what a sample of the output will look like:

If you want to take the code further and do more analysis, the possibilities are endless of what you can do.  For example, you can write a quick script to select only articles above a certain subjectivity level, or you can make a comparison graph of polarity values from different sections in a news site.

For more information, I encourage you to check out the Newspaper3k documentation.  There is also an excellent resource here on GitHub as well.

I have also posted the code for both programs on my Github page for you to copy if you wish. You can read my follow up article here: