As you know RSS Feed helps us to stay updated with the latest news, blog posts, and latest social media posts. With this information, you can use it for your personal or business growth. You can discover new job posts too from the job websites.
How do we subscribe to this RSS Feed? There are two options to receive feeds.
RSS Feed readers like Feedly, NewsBlur, etc. We have to pay for a subscription. The next option is to get the feed by using Python script.
Are you thinking about how to parse the RSS feed through Python? What right web scrape library to be used for this purpose? Is there any special library for this?
Yes, there is a special library exclusive for parsing RSS Feeds of your favorite website.
In this short article, you will learn about this special library and how to use it. Also, you will know how to identify RSS Feed URL links where a website doesn’t provide you RSS links directly.
Before that, you have to understand what RSS feed is and its XML Elements.
What is an RSS feed?
RSS feed stands for Really Simple Syndication. It is a web feed where users access the real-time updates of website content. RSS feed is created by using XML. Website owners create an RSS feed for users. So that they can receive the latest updates of content such as the latest news, new blog post, the latest article published, and latest job post. You can identify the RSS Feed icon on the website for the subscription. RSS feed in Upwork is very helpful for your freelancer career as you will receive the newest job post. You can lead with other competitors in applying for the jobs. RSS was first created by Dave Winer and later developed through the years until 2003 where RSS 2.0 is released.
RSS XML Elements
The beginning of the XML script shows XML version 1.0 and character coding. Also, it shows a version of RSS which is 2.0.
|<channel>||This is the element to identify RSS FEED.||Image 2 Marked 2.|
|<title>||This element is to define the title of the channel.||Image 2 Marked 3. In this tag, you find the text “Title of RSS feed”.|
|<link>||This element displays the URL of the channel.||Image 2 Marked 5 –“http://www.example.com”|
|<description>||This element describes the channel.||Image 2 Marked 4. The description of the channel is “Describe RSS Feed”|
|<item>||Under this is the element where stories or post is created. Under this element, there are three elements which is ||Image 2 Marked 6 for Item Tag. Image 2 Marked 7,8,& 9 where it displays the title of the first entry ”First Entry”, URL link “http://www.example.com/blog/post/1” and description of the first entry “Describe First Entry”.|
|</item> </RSS></channel>||These are the closing tags.||Image 2 Marked 10,11 and 12.|
The Python Feedparser Module
Now that you have learned about the RSS feed and its elements.
Let us now learn to parse the RSS feed using a special python module.
The special python module is
Universal Feed Parser(feedparser) module is used for downloading and parsing syndicated feeds. This module supports RSS, Atom, and CSF feed. But we are interested in RSS Feed. So, we will focus on this only. We can parse RSS Feed from remote URL, local file, and string.
We will parse the RSS feed from UPWORK.com to get the latest job post of your area. Is it wonderful when every time you run this program you get new jobs to apply for? Yes of course. You can use this program for other job websites as well.
Step 1: Install feedparser module
pip install feedparser
You can install feedparser using the above command.
Step 2 : Import feedparser module
Import the module by entering the above code
Step 3: Identify RSS feed URL from Upwork Job website
In this step, you will learn to identify the RSS feed URL from the Upwork website.
Login to your Upwork website and search for the job you are interested say for example “web scraper”. Click the green color icon below the search field as shown in Image 2.
You will get the menu with RSS and Atom. Click the RSS as in Image 3 to proceed.
Then a new window will be opened displaying XML document as shown below in Image 4.
Copy the URL of the webpage which is the RSS feed URL web address.
Step 4: Parse Upwork RSS feed URL
# Parse Upwork URL d = feedparser.parse('https://www.upwork.com/ab/feed/jobs/rss?q=web+scraper&sort=recency&paging=0%3B10&api_params=1&securityToken=b81cd9281c89f630d0c13022476f3bea26d22c5590013ab4f43c4e390c86a52d69ff5876be0a6d7b174b8888dab7e7aaa59cd884c771490d6f4c09b0d3b903b2&userUid=955976492232273920&orgUid=955976492236468225')
You have to parse the RSS feed URL by using feedparser.parse() method.
Insert the copied URL as shown in the above code. Store the contents of the RSS feed URL into the variable d.
Step 5: Print details from the Upwork Job website
d.entries method is used to get entries of the feed and convert them in to list. From that list we can access the title, link, and description of the items using these methods:
Let us inspect the elements of the Upwork RSS feed XML page. Refer to Image 6.
As you can see in the XML page above, there are Item, title, link, and description elements. These elements are marked by the black arrow in Image 6. Within the title tag job title
CDATA[LEAD GENERATION-Upwork] is displayed. The URL link of the job post is displayed within the Link Tag. Likewise, you can find description details of the job “I own staffing company where we match persons for all types of job and in every industry” is displayed.
Now let’s use the method to get each latest job detail.
for n in range(0,len(d.entries)): print (d.entries[n].title) # code 1 print (d.entries[n].link) # code 2 s = (d.entries[n].description) # code 3 # description print(s.split('<br />').replace(' ','').replace(''','')) # code 4 print(s.split('<br />').replace(' ','').replace(''','')) # code 5 #date print((s.split('<b>')).replace("</b>:","").replace("<br />","").replace('<br /><b>Category','')) # code 6 # hour print("Price:") print(s.split('</b>').replace(":"," ").replace("<br /><b>Posted On","").replace('<br /><b>Category','')) # code 7
To print all the items of the job list, you can use for loop which iterates until the length of the total items.
- code 1: Prints the title of the job post,
- code 2: Prints the URL of the job post,
- code 3: Store the description details of the job into the variable
- code 4: Print the description of first item in the list using
split(). Removes the XML tags of
' ', '''.
- code 5: Print the description of second item in the list using
split(). Removes the XML tags of ‘
- code 6 :Prints the date of Job posted from the description using
split(). Removes the XML tags
</b>:","<br />", '<br /><b>Category'.
- code 7: Prints Budget of the Job using
split(). Removes XML tags as shown in the above code.
If you print the description using
print(d.entries[n].description), you will get all the data in the list such as the Budget of the job, Date of job posted, and category. Also, you will see the XML tags such as
<br /> ,<br /><b>Category.
Refer to Image 8 below:
This is the reason you have to use code 4,5,6,7 to remove the XML tags from the output.
You can also use these steps to get RSS feed from other job websites, news, and blog websites.
Some websites do not have an RSS icon on the website. So, you have to Inspect HTML elements from the developer menu to find RSS Feed Url. On the New York Times website (https://www.nytimes.com/) you can identify RSS Feed by searching “RSS” in the find bar. Refer to Image 9 for the RSS URL link.
Wow, Readers! Now you can be able to parse RSS FEED using Python special library feedparser. You have an excellent idea of what right library to be used for this purpose. With a few lines of python script using feedparser library, you can all the RSS feed from any website in no time.
Feeds will help you to increase your personal or business growth by keeping updated on the latest information. Suppose you would want to be SME in Machine Learning and your aim is to be updated on the latest development of Machine learning. Then RSS Feed will be very beneficial for you in this competitive world.
You can develop the program further to receive the RSS feeds information in your email inbox. We will show you how to do this in future articles.