Though Pythonβs BeautifulSoup module was designed to scrape HTML files, it can also be used to parse XML files.
In todayβs professional marketplace, it is useful to be able to change an XML file into other formats, specifically dictionaries, CSV, JSON, and dataframes according to specific needs.
In this article, we will discuss that process.
Scraping XML with BeautifulSoup
π‘ Extensible Markup Language or XML differs from HTML in that HTML primarily deals with how information is displayed on a webpage, and XML handles how data is stored and transmitted. XML also uses custom tags and is designed to be user and machine-readable.
When inspecting a webpage, a statement at the top of the page will denote what type of file you are viewing.
For an XML file, you may see <?xml version="1.0"?>
.
As a side note, βversion 1.0
β is a little deceiving in that several modifications have been made since its inception in 1998 the name has just not changed.
Despite the differences between HTML and XML, because BeautifulSoup creates a Python object tree, it can be used to parse both. The process for parsing both is similar. For this article, I will be using a sample XML file from w3 schools.com.
Import the BeautifulSoup library and requests modules to scrape this file.
# Import needed libraries from pprint import pprint from bs4 import BeautifulSoup import requests
Once these have been imported, request the content of the webpage.
# Request data webpage = requests.get("https://www.w3schools.com/xml/cd_catalog.xml") data = webpage.content pprint(data)
At this point, I like to print just to make sure I am getting what I need. I use the pprint()
function to make it more readable.
Next, create a BeautifulSoup object and declare the parser to be used. Because it is an XML file, use an XML parser.
# Create a BeautifulSoup object soup = BeautifulSoup(data, 'xml') print(soup.prettify())
With that printed, you can see the object tree created by BeautifulSoup. The parent, β<CATALOG>
β, its child β<CD>
β, and all of the children of βCD
β are displayed.
Output of the first CD:
<CATALOG> <CD> <TITLE>Empire Burlesque</TITLE> <ARTIST>Bob Dylan</ARTIST> <COUNTRY>USA</COUNTRY> <COMPANY>Columbia</COMPANY> <PRICE>10.90</PRICE> <YEAR>1985</YEAR> </CD>
All left is to scrape the desired data and display it.
Using the enumerate()
and find_all()
function each occurrence of a tag can be found, and its contents can be placed into a list.
After that, using a for
loop, unpack the created lists, and create groupings. The .text
attribute string and strip()
function gives only the text and removes the white space.
Just for readability, print a blank line after each grouping.
# Scrape data parent = soup.find('CATALOG') for n, tag in enumerate(parent.find_all('CD')): title = [x for x in tag.find_all('TITLE')] artist = [x for x in tag.find_all('ARTIST')] country = [x for x in tag.find_all('COUNTRY')] company = [x for x in tag.find_all('COMPANY')] price = [x for x in tag.find_all('PRICE')] year = [x for x in tag.find_all('YEAR')] # view data for item in title: print('Title: ', item.text.strip()) for item in artist: print('Artist: ', item.text.strip()) for item in country: print('Country: ', item.text.strip()) for item in company: print('Company: ', item.text.strip()) for item in price: print('Price: ', item.text.strip()) for item in year: print('Year: ', item.text.strip()) print()
With that, the CDs should be cataloged in this format.
Title: Empire Burlesque Artist: Bob Dylan Country: USA Company: Columbia Price: 10.90 Year: 1985
XML to Dictionary
Besides lists, dictionaries are a common structure for storing data in Python.
Information is stored in key: value pairs. Those pairs are stored within curly {}
brackets.
Example: capital = {Pennsylvania: Harrisburg, Michigan: Lansing}
The key of the pair is case-sensitive and unique. The value can be any data type and may be duplicated.
Accessing the value of the pair can be done via the Key. Since the key cannot be duplicated, finding a value in a large dictionary is easy so long as you know the key. A key list can be obtained using the keys()
method.
Example: print(capital.keys())
Finding information in a dictionary is quick since you only search for a specific key.
Dictionaries are used quite often, if memory usage is not a concern, because of the quick access. For this reason, it is important to know how to convert information gained in an XML file to a dictionary.
There are six basic steps to convert an XML to a dictionary:
import xmltodict
import pprint
with open('C:\Users\Jordan Marshall\Downloads\cd_catalog.xml', 'r', encoding='utf-8') as file:
cd_xml = file.read()
cd_dict = xmltodict.parse(cd_xml)
cd_dict_list = [dict(x) for x in cd_dict['CATALOG']['CD']]
pprint.pprint(cd_dict_list)
First, for the conversion, Python has a built-in called xmltodict
. So first import that module and any other modules to be used.
import xmltodict import pprint
Second, the file needs to be opened, read, and assigned to a variable.
with open('C:\\Users\\Jordan Marshall\\Downloads\\cd_catalog.xml', 'r', encoding='utf-8') as file: cd_xml = file.read()
Third, using xmltodict.parse()
convert the XML file to a dictionary and view it.
cd_dict = xmltodict.parse(cd_xml) cd_dict_list = [dict(x) for x in cd_dict['CATALOG']['CD']] pprint.pprint(cd_dict_list)
The output of this is a nice clean list of dictionaries. To view all artists, a simple for
loop can be used.
for item in cd_dict_list: print(item['ARTIST'])
XML to JSON
π‘ JSON stands for JavaScript Object Notation. These files store data in key:value
form like a Python dictionary. JSON files are used primarily to transmit data between web applications and servers.
Converting an XML file to a JSON file requires only a few lines of code.
As always, import the needed libraries and modules.
import json from pprint import pprint import xmltodict
Again, you will see the use of xmltodict
. Because of their similarities, first, convert the file to a dictionary and then later write it to a JSON file. The json_dumps()
function is used to take in the XML data. That data will later be written to a JSON file.
with open('C:\\Users\\Jordan Marshall\\Downloads\\cd_catalog example.xml') as xml_file: data_dict = xmltodict.parse(xml_file.read()) xml_file.close() json_data = json.dumps(data_dict) with open('data.json', 'w') as json_file: json_file.write(json_data) json_file.close()
Output:
('{"CATALOG": {"CD": [{"TITLE": "Empire Burlesque", "ARTIST": "Bob Dylan", ' '"COUNTRY": "USA", "COMPANY": "Columbia", "PRICE": "10.90", "YEAR": "1985"}, ' '{"TITLE": "Hide your heart", "ARTIST": "Bonnie Tyler", "COUNTRY": "UK", ' '"COMPANY": "CBS Records", "PRICE": "9.90", "YEAR": "1988"}, {"TITLE": ' '"Greatest Hits", "ARTIST": "Dolly Parton", "COUNTRY": "USA", "COMPANY": ' '"RCA", "PRICE": "9.90", "YEAR": "1982"}, {"TITLE": "Still got the blues", 'β¦.)
The data that started as an XML file has now been written to a JSON file called json_data
.
XML to DataFrame
There are a couple of ways to achieve this goal.
Using Pythonβs ElementTree
is one. I am, however, partial to Pandas.
π‘ Pandas is a great module for working with data, and it simplifies many daily tasks of a programmer and data scientist. I strongly suggest becoming familiar with this module.
For this code, use a combination of BeautifulSoup and Pandas.
Import the necessary libraries.
import pandas as pd from bs4 import BeautifulSoup
To display the output fully, display values may need to be altered. I am going to set the max number of columns as well as the display width. This will overwrite any default settings that may be in place.
Without doing this, you may find some of your columns are replaced by ββ¦
β or the columns may be displayed under your first couple of columns.
# set max columns and display width pd.set_option("display.max_columns", 10) pd.set_option("display.width", 1000)
The width and columns can be changed according to your needs. With that completed, open and read the XML file. Store the contents in a variable.
xml_file = open('C:\\Users\\Jordan Marshall\\Downloads\\cd_catalog.xml', 'r') contents = xml_file.read()
Next, create a BeautifulSoup object.
# BeautifulSoup object soup = BeautifulSoup(contents, 'xml')
The next step is to extract the data and assign it to a variable.
# Extract data and assign it to a variable title = soup.find_all("TITLE") artist = soup.find_all("ARTIST") country = soup.find_all("COUNTRY") company = soup.find_all("COMPANY") price = soup.find_all("PRICE") year = soup.find_all("YEAR")
Now a for
loop can be used to extract the text.
Should data be added or removed at any time using the length of one of the variables removes the need to know from memory how many items are cataloged.
Place the text in an empty list.
# Text cd_info = [] for i in range(0, len(title)): rows = [title[i].get_text(), artist[i].get_text(), country[i].get_text(), company[i].get_text(), price[i].get_text(), year[i].get_text()] cd_info.append(rows)
Lastly, create the data frame and name the columns.
# Create a dataframe with Pandas and print df = pd.DataFrame(cd_info, columns=['Title', 'Artist ', ' Company', 'Country', ' Price', ' Year']) print(df)
Output
Title Artist Country Company Price Year
0 Empire Burlesque Bob Dylan USA Columbia 10.90 1985
1 Hide your heart Bonnie Tyler UK CBS Records 9.90 1988
2 Greatest Hits Dolly Parton USA RCA 9.90 1982
A nice, neat table containing each CDβs data has been created.
XML to CSV
π‘ A CSV file or comma-separated values file contains plain text easily readable by the user. It can contain numbers and letters only and is used to exchange data between apps. CSV files can be opened by any editor.
For example, Microsoft Excel. Each line represents a new row of data. The comma represents a new column. Using the code from above the XML file can be converted to a CSV file with one new line.
catalog = df.to_csv('cd catalog.csv')
With that, go to files
and search the C:
drive for 'cd catalog.csv'
. It will open in the default program used for spreadsheets. In this case Microsoft Excel.
Title | Artist | Country | Company | Price | Year |
Empire Burlesque | Bob Dylan | USA | Columbia | 10.90 | 1985 |
Hide your heart | Bonnie Tyler | UK | CBS Records | 9.90 | 1988 |
Greatest Hits | Dolly Parton | USA | RCA | 9.90 | 1982 |
Still got the blues | Gary Moore | UK | Virgin records | 10.20 | 1990 |
Eros | Eros Ramazzotti | EU | BMG | 9.90 | 1997 |
One night only | Bee Gees | UK | Polydor | 10.90 | 1998 |
Sylvias Mother | Dr.Hook | UK | CBS | 8.10 | 1973 |
Maggie May | Rod Stewart | UK | Pickwick | 8.50 | 1990 |
Romanza | Andrea Bocelli | EU | Polydor | 10.80 | 1996 |
π Related Tutorial: How to Convert a KML to a CSV File in Python?