BeautifulSoup Archives - Be on the Right Side of Change

How I Scraped Data From Over 16,000 Gyms from MindBodyOnline.com

Charles Blue — Thu, 19 Oct 2023 08:56:58 +0000

This article is based on a freelance job posted on Upwork to scrape data for all the gyms in the USA from MindBodyOnline.com or another similar site. I treated this as a learning project, and it was a good one, as I learned a lot!

Web scraping, a technique used to extract data from websites, has become an essential skill on Upwork — it’s one of the most sought-after skills on most freelancing platforms. Most beginners start with the Beautiful Soup and Requests modules in Python. While these tools are powerful, they’re not always sufficient for every site. Enter tools like Selenium, which, while powerful, can sometimes be overkill or inefficient.

So, where should one start? The answer is simple: Always check for an API first.

Why Start with APIs?

An Application Programming Interface (API) allows two software applications to communicate with each other. Many websites offer APIs to provide structured access to their data, making it easier and more efficient than scraping the web pages directly.

Benefits of using APIs:

Efficiency: Extracting data from APIs is often faster and less resource-intensive than scraping web pages.
Reliability: APIs are designed to be accessed programmatically, reducing the chances of breaking changes.
Ethical considerations: Accessing data via an API is often more in line with a website’s terms of service than scraping their pages directly.

MindBodyOnline provides a dedicated API tailored for developers: MindBody API.

If you’re aiming to craft an app utilizing their dataset, this API is your ideal resource. It boasts a plethora of endpoints, enabling swift data retrieval and ensuring seamless interaction between your application and their servers.

But what if you aren’t creating an application and just need to scrape data once for research? MindBodyOnline also retrieves data for its website via an API. Javascript is used to request the data needed to populate their website. We can also make requests for this API.

How to check if a website is rendered with Javascript

The site we will be scraping is MindBodyOnline.

If a website is rendered with Javascript, we should check the network traffic and see if we can find a request that returns the data we see on the page. This can be done quickly with developer tools. With Chrome, you can bring up developer tools by clicking Ctl-Shift-I.

From here, we can turn off Javascript, then refresh the page and see if there are any changes. To turn off Javascript, first hit Ctl-Shift-P to bring up the command palette. Start typing Javascript to filter the options, then click “Disable javascript”.

Then refresh the page. As we can see, they use JavaScript for all the data.

Before we can continue, we need to turn JavaScript back on. Bring up the command palette again, filter for javascript, and click “Enable Javascript”. Then refresh the page again.

Check the JavaScript Requests

Select the Network tab in developer tools.

Make sure Fetch/XHR and Preserve log are selected. Next, we can click the circle with the line through it to clear the output. Then perform a search to see what requests were performed.

We can then check each item in the output to see if it returns useful information.

We are primarily interested in the response to the request. We are looking for XML data that looks like the data shown on the page. In this case, it is the locations request that contains the data we seek.

We can also see that there is a payload required. When we make our requests, we must provide this payload in the request body. There are three items of interest here. The latitude and longitude allow us to control the city we are pulling data for, and we also need to provide a page number.

MindBody uses pagination, so a relatively small amount of data is pulled with each request. A large city like New York can have over a hundred pages.

We go to the headers tab to copy the request URL.

Using Insomnia to Generate Request Headers

From here, we can use a tool to help us with the request syntax.

Insomnia is a powerful open-source API client tool for testing and debugging APIs. It provides a user-friendly interface to send requests to web services and view responses. With Insomnia, you can define various request types, from simple HTTP GET requests to complex JSON, GraphQL, or even multipart file uploads. You can download the insomnia desktop app here.

Using Insomnia is quite simple. Just paste in the API URL and click Send.

We can check the preview tab to make sure it returns the data we want:

This is where it gets good. If we click the dropdown on the send button, one of the options is “generate client code”. How convenient! Just click Python as the language and use the Requests module and you can click “Copy to Clipboard” and you’re off to the races.

A Simple Scrapy Spider

The code can be found on Github. I will walk through the code below, starting with the imports.

import scrapy
import json
import pandas as pd
from scrapy.crawler import CrawlerProcess
import os

Scrapy is a good option because it can handle multiple requests at the same time with asynchronous processing. Scapy has a lot of bells and whistles and a fair bit of a learning curve, but it’s also possible to avoid a lot of the extra complexity. The goal here was to place all the code in one simple script.

First, we have to create a spider class. The class is pretty large so I’ll display it in chunks.

class MindbodySpider(scrapy.Spider):
    name = 'mindbody_spider'

    custom_settings = {
        'CONCURRENT_REQUESTS': 5,
        'DOWNLOAD_DELAY': 3.2,
    }

Our class inherits from one of the Scrapy Spider classes with scrapy.Spider being the simplest. In the custom settings, with CONCURRENT_REQUESTS set to 5, scrapy will be processing six requests at a time, starting a new one as soon as one finishes.

We use a DOWNLOAD_DELAY so we don’t bombard the website with too many requests at once.

Next, we need a starting template for the payload

starting_payload = '''{
                          "sort":"-_score,distance",
                          "page":{"size":50,"number":<>},
                          "filter":{"categories":"any",
                                    "latitude":<>,
                                    "longitude":<>,
                                    "categoryTypes":"any"}
                       }'''

Next, we have the headers that Insomnia so helpfully provided for us.

headers = {
        "cookie": "__cf_bm=zdIhLHXKd2OAveBChKORUMdydUFVzC2Ma51sQxv.UJ0-1694646164-0-Abmbwcj2wNw%2FpityY4DWRWy%2FftBkjTO0vQ3tZ0gwU0P5bsTqcasf2XZlBwL%2BUaevGaH%2BTDzZOJPBXbWYwgsXkJc%3D",
        "authority": "prod-mkt-gateway.mindbody.io",
        "accept": "application/vnd.api+json",
        "accept-language": "en-US,en;q=0.9",
        "content-type": "application/json",
        "origin": "https://www.mindbodyonline.com",
        "sec-ch-ua": "^\^Not/A",
        "sec-ch-ua-mobile": "?0",
        "sec-ch-ua-platform": "^\^Windows^^",
        "sec-fetch-dest": "empty",
        "sec-fetch-mode": "cors",
        "sec-fetch-site": "cross-site",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
        "x-mb-app-build": "2023-08-02T13:33:44.200Z",
        "x-mb-app-name": "mindbody.io",
        "x-mb-app-version": "e5d1fad6",
        "x-mb-user-session-id": "oeu1688920580338r0.2065068094427127"
    }

Then a very simple init method

def __init__(self):
        scrapy.Spider.__init__(self)
        self.city_count = 0

The start_requests method loops through each city. This is the main loop that creates the first request for each city.

def start_requests(self):
        cities = pd.read_csv('uscities.csv')

        for idx, city in cities[].iterrows():
            lat, lon = city.lat, city.lng
            self.logger.info(f"{city.city}, {city.state_id} started")

            # Start with the first page for each city
            payload = self.starting_payload.replace('<>', '1').replace('<>', str(lat)).replace('<>', str(lon))

            yield scrapy.Request(
                url="https://prod-mkt-gateway.mindbody.io/v1/search/locations",
                method="GET",
                body=payload,
                headers=self.headers,
                meta={'city_name': city.city, 'page_num': 1, 'lat': lat, 'lon': lon, 'state': city.state_id},
                callback=self.parse
            )

The code is pretty simple. We create a DataFrame from a CSV file with city information and then loop through it with the iterrows method. We create the payload for the request using the template and the lat/long values from the DataFrame. The page is set to 1 each time. We will handle additional pages later.

Finally, we yield a scrapy.Request object. We use yield instead of return so we can handle multiple requests concurrently. The body is our modified payload, and we use the same header for each request.

What do we do with the response returned from the request? As soon as the response is returned it is fed into the parse method thanks to the callback parameter:

callback=self.parse

The meta parameter gives us a way to pass information to the callback function. We need the page num, lat, lon values for the next request. City_name and state are used for screen outputs.

The list of cities was downloaded off the web. Many different options will work, as long as they contain latitude and longitude values.

Parsing the Response

The parse method is a little long, but not too complicated.

Getting the data and saving it is very easy. We just convert response.text to a DataFrame and save it to a CSV file. If the file already exists, we will append the data and not include a header. Otherwise, we create a new CSV file and include a header.

def parse(self, response):
        data = json.loads(response.text)
        gyms_df = pd.json_normalize(data['data'])

        # Save the dataframe to a CSV
        city_name = response.meta['city_name']
        state = response.meta['state']
        fname = f'{city_name} {state}.csv'.replace(' ', '_')
        csv_path = f'./data/cities2/{fname}'

        # Check if file exists to determine the write mode
        write_mode = 'a' if os.path.exists(csv_path) else 'w'

        gyms_df.to_csv(csv_path, 
                       mode=write_mode, 
                       index=False, 
                       header=(not os.path.exists(csv_path)))

Handling Pagination

To move on to the next page, we need to create another Scrapy Request. For the payload we use the same latitude and longitude and increment the page number by 1.

        # Check if there's another page and if so, initiate the request
        next_page_num = response.meta['page_num'] + 1
        if next_page_num <= 150:  # Optional: upper limit
            lat, lon = response.meta['lat'], response.meta['lon']  # Assuming you store lat and lon in meta too

            payload = self.starting_payload.replace('<>', '1').replace('<>', str(lat)).replace('<>', str(lon))

Make the Request for the Next Page

To finish the parse method, all we have to do is make another request with the new payload.

yield scrapy.Request(
                url="https://prod-mkt-gateway.mindbody.io/v1/search/locations",
                method="GET",
                body=payload,
                headers=self.headers,
                meta={'city_name': response.meta['city_name'], 
                      'page_num': next_page_num, 
                      'lat': lat, 
                      'lon': lon,
                      'state': state},
                callback=self.parse
            )

        self.city_count += 1
        print(response.meta['city_name'], f'complete ({self.city_count})')
        self.logger.info(f"""{response.meta['city_name']}, 
                           {response.meta['state']} is complete""")

How the Pagination Loop Terminates

What happens if there are 100 pages for the current city and the code sends a request with page_num = 101?

The request will not return anything, so the callback function won’t get called and the recursive loop for that city will stop.

Then the start_requests loop will move on to the next city.

It’s alive! Setting Our Little Spider Loose

To get our creepy critter crawling, we create a CrawlerProcess. Then tell it to crawl. Then tell it to start. On your mark, get set, CRAWL!

process = CrawlerProcess()
process.crawl(MindbodySpider)
process.start()

Results

I was able to scrape data for 16,000 cities in about half a week. I think I averaged about 100 cities an hour. The larger cities had over a hundred pages but there were thousands upon thousands of cities with 5-10 pages.

What about the data? It’s fairly extensive and could be very useful.

Pretty good information related to services offered, location, amenities, total ratings etc. Looking at the rest of the columns:

Conclusion

Uncovering the API proved invaluable. It eliminated the need to craft path selectors for individual data elements, significantly streamlining the process. Moreover, it spared me from devising a Scrapy workaround for the JavaScript-rendered page. Investing time in learning Scrapy was a sound decision, given its superior speed compared to other methods I explored.

Looking ahead, the logical progression is to integrate the data into platforms like Jupyter Notebook, Power BI, or Tableau. Furthermore, storing the data in a database seems apt, especially considering the apparent one-to-many relationships observed in each city, like categories and subcategories.

If you want to become a master web scraper, feel free to check out our academy course with downloadable PDF certificate to showcase your skills to future employers or freelancing clients:

Academy: Web Scraping with BeautifulSoup

The post How I Scraped Data From Over 16,000 Gyms from MindBodyOnline.com appeared first on Be on the Right Side of Change.

Python BS4 – How to Scrape Absolute URL Instead of Relative Path

Shubham Sayon — Thu, 28 Sep 2023 19:56:58 +0000

Summary: Use urllib.parse.urljoin() to scrape the base URL and the relative path and join them to extract the complete/absolute URL. You can also concatenate the base URL and the absolute path to derive the absolute path; but make sure to take care of erroneous situations like extra forward-slash in this case.

Quick Answer

When web scraping with BeautifulSoup in Python, you may encounter relative URLs (e.g., /page2.html) instead of absolute URLs (e.g., http://example.com/page2.html). To convert relative URLs to absolute URLs, you can use the urljoin() function from the urllib.parse module.

Below is an example of how to extract absolute URLs from the a tags on a webpage using BeautifulSoup and urljoin:

from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin

# URL of the webpage you want to scrape
url = 'http://example.com'

# Send an HTTP request to the URL
response = requests.get(url)
response.raise_for_status()  # Raise an error for bad responses

# Parse the webpage content
soup = BeautifulSoup(response.text, 'html.parser')

# Find all the 'a' tags on the webpage
for a_tag in soup.find_all('a'):
    # Get the href attribute from the 'a' tag
    href = a_tag.get('href')

    # Use urljoin to convert the relative URL to an absolute URL
    absolute_url = urljoin(url, href)

    # Print the absolute URL
    print(absolute_url)

In this example:

url is the URL of the webpage you want to scrape.
response is the HTTP response obtained by sending an HTTP GET request to the URL.
soup is a BeautifulSoup object that contains the parsed HTML content of the webpage.
soup.find_all('a') finds all the a tags on the webpage.
a_tag.get('href') gets the href attribute from an a tag, which is the relative URL.
urljoin(url, href) converts the relative URL to an absolute URL by joining it with the base URL.
absolute_url is the absolute URL, which is printed to the console.

Now that you have a quick overview let’s dive into the specific problem more deeply and discuss various methods to solve this easily and effectively.

Problem Formulation

Problem: How do you extract all the absolute URLs from an HTML page?

Example: Consider the following webpage which has numerous links:

Link: https://sayonshubham.github.io/

Now, when you try to scrape the links as highlighted above, you find that only the relative links/paths are extracted instead of the entire absolute path. Let us have a look at the code given below, which demonstrates what happens when you try to extract the 'href' elements normally.

from bs4 import BeautifulSoup
import urllib.request
from urllib.parse import urljoin
import requests

web_url = 'https://sayonshubham.github.io/'
headers = {"User-Agent": "Mozilla/5.0 (CrKey armv7l 1.5.16041) AppleWebKit/537.36 (KHTML, like Gecko) "
                         "Chrome/31.0.1650.0 Safari/537.36"}
# get() Request
response = requests.get(web_url, headers=headers)
# Store the webpage contents
webpage = response.content
# Check Status Code (Optional)
# print(response.status_code)
# Create a BeautifulSoup object out of the webpage content
soup = BeautifulSoup(webpage, "html.parser")
for i in soup.find_all('nav'):
    for url in i.find_all('a'):
        print(url['href'])

Output:

/
/about
/blog
/finxter
/

The above output is not what you desired. You wanted to extract the absolute paths as shown below:

https://sayonshubham.github.io/
https://sayonshubham.github.io/about
https://sayonshubham.github.io/blog
https://sayonshubham.github.io/finxter
https://sayonshubham.github.io/

Without further delay, let us go ahead and try to extract the absolute paths instead of the relative paths.

Method 1: Using urllib.parse.urljoin()

The easiest solution to our problem is to use the urllib.parse.urljoin() method.

According to the Python documentation: urllib.parse.urljoin() is used to construct a full/absolute URL by combining the “base URL” with another URL. The advantage of using the urljoin() is that it properly resolves the relative path, whether BASE_URL is the domain of the URL, or the absolute URL of the webpage.

from urllib.parse import urljoin

URL_1 = 'http://www.example.com'
URL_2 = 'http://www.example.com/something/index.html'

print(urljoin(URL_1, '/demo'))
print(urljoin(URL_2, '/demo'))

Output:

http://www.example.com/demo
http://www.example.com/demo

Now that we have an idea about urljoin, let us have a look at the following code which successfully resolves our problem and helps us to extract the complete/absolute paths from the HTML page.

Solution:

from bs4 import BeautifulSoup
import urllib.request
from urllib.parse import urljoin
import requests

web_url = 'https://sayonshubham.github.io/'
headers = {"User-Agent": "Mozilla/5.0 (CrKey armv7l 1.5.16041) AppleWebKit/537.36 (KHTML, like Gecko) "
                         "Chrome/31.0.1650.0 Safari/537.36"}
# get() Request
response = requests.get(web_url, headers=headers)
# Store the webpage contents
webpage = response.content
# Check Status Code (Optional)
# print(response.status_code)
# Create a BeautifulSoup object out of the webpage content
soup = BeautifulSoup(webpage, "html.parser")
for i in soup.find_all('nav'):
    for url in i.find_all('a'):
        print(urljoin(web_url, url.get('href')))

Output:

https://sayonshubham.github.io/
https://sayonshubham.github.io/about
https://sayonshubham.github.io/blog
https://sayonshubham.github.io/finxter
https://sayonshubham.github.io/

Method 2: Concatenate The Base URL And Relative URL Manually

Another workaround to our problem is to concatenate the base part of the URL and the relative URLs manually, just like two ordinary strings. The problem, in this case, is that manually adding the strings might lead to “one-off” errors — try to spot the extra front slash characters / below:

URL_1 = 'http://www.example.com/'
print(URL_1+'/demo')

# Output --> http://www.example.com//demo

Therefore to ensure proper concatenation, you have to modify your code accordingly such that any extra character that might lead to errors is removed. Let us have a look at the following code that helps us to concatenate the base and the relative paths without the presence of any extra forward slash.

Solution:

from bs4 import BeautifulSoup
import urllib.request
from urllib.parse import urljoin
import requests

web_url = 'https://sayonshubham.github.io/'
headers = {"User-Agent": "Mozilla/5.0 (CrKey armv7l 1.5.16041) AppleWebKit/537.36 (KHTML, like Gecko) "
                         "Chrome/31.0.1650.0 Safari/537.36"}
# get() Request
response = requests.get(web_url, headers=headers)
# Store the webpage contents
webpage = response.content
# Check Status Code (Optional)
# print(response.status_code)
# Create a BeautifulSoup object out of the webpage content
soup = BeautifulSoup(webpage, "html.parser")
for i in soup.find_all('nav'):
    for url in i.find_all('a'):
        # extract the href string
        x = url['href']
        # remove the extra forward-slash if present
        if x[0] == '/':       
            print(web_url + x[1:])
        else:
            print(web_url+x)

Output:

https://sayonshubham.github.io/
https://sayonshubham.github.io/about
https://sayonshubham.github.io/blog
https://sayonshubham.github.io/finxter
https://sayonshubham.github.io/

Caution: This is not the recommended way of extracting the absolute path from a given HTML page. In situations when you have an automated script that needs to resolve a URL but at the time of writing the script you don’t know what website your script is visiting, in that case, this method won’t serve your purpose, and your go-to method would be to use urlljoin. Nevertheless, this method deserves to be mentioned because, in our case, it successfully serves the purpose and helps us to extract the absolute URLs.

Conclusion

In this article, we learned how to extract the absolute links from a given HTML page using BeautifulSoup. If you want to master the concepts of Pythons BeautifulSoup library and dive deep into the concepts along with examples and video lessons, please have a look at the following link and follow the articles one by one wherein you will find every aspect of BeautifulSoup explained in great details.

With that, we come to the end of this tutorial! Please stay tuned and subscribe for more interesting content in the future.

The post Python BS4 – How to Scrape Absolute URL Instead of Relative Path appeared first on Be on the Right Side of Change.

3 Pythonic Ways to Download a PDF from a URL

Chris — Thu, 20 Jul 2023 19:14:33 +0000

If you’re short on time, here’s the code for copy and paste:

import requests

url = 'https://bitcoin.org/bitcoin.pdf'
response = requests.get(url)

with open('sample.pdf', 'wb') as f:
    f.write(response.content)

Let’s dive into the whole article, keep reading to learn and improve your skills (and enjoy the beautiful spider images I hand-picked for you)!

Quick overview: I’ll show you the three most Pythonic ways to download a PDF from a URL in Python:

Method 1: Use the requests library, a third-party library that allows you to send HTTP requests using Python.
Method 2: Use the urllib module, a built-in Python library for handling URLs.
Method 3: Use the popular BeautifulSoup library for web scraping.

But first things first…

Understanding the Basics

To download PDFs from a URL in Python, one must first understand the basics of web scraping. Web scraping is the process of extracting data from websites. It involves parsing HTML and other web page content to extract the desired information.

Step 1: The first step in web scraping is to send an HTTP request to the URL of the web page you want to access. Once you have sent the request, you will receive an HTTP response from the server. This response will contain the HTML content of the web page.

Step 2: To extract the PDF file link from the HTML content, use a Python library such as Requests and BeautifulSoup. Requests is a Python library used for making HTTP requests to a website, while BeautifulSoup is used for parsing the HTML content of a web page.

Step 3: Once you have parsed the HTML content and located the PDF file link, you can use the Requests library to download the PDF file. The Requests library provides a simple way to download files from the web. You can use the “get” method to download the PDF file from the URL.

Note: Some websites may have restrictions on downloading PDF files. In such cases, you may need to provide additional headers to the HTTP request to bypass these restrictions.

In summary, to download a PDF file from a URL in Python, you need to:

Send an HTTP request to the URL of the web page you want to access
Parse the HTML content of the web page using BeautifulSoup
Locate the PDF file link in the HTML content
Use the Requests library to download the PDF file from the URL

Recommended: Is Web Scraping Legal?

Method 1: Using the Requests Library

Python’s Requests library is a popular HTTP library that allows developers to send HTTP requests using Python. It is a simple and easy-to-use library that supports various HTTP methods, including GET, POST, PUT, DELETE, and more.

In this section, we will explore how to use the Requests library to download PDF files from a URL in Python.

Setting Up Requests

Before we can use the Requests library, we need to install it. We can install it using pip, which is a package manager for Python. To install requests, open a command prompt or terminal, and type the following command:

pip install requests

Once installed, we can import the Requests library in our Python script using the following statement:

import requests

Downloading a PDF File

To download a PDF file from a URL using the Requests library, we can use the get() method, which sends an HTTP GET request to the specified URL and returns a response object. We can then use the content attribute of the response object to get the binary content of the PDF file.

Here’s an example code snippet that demonstrates how to download a PDF file using requests:

import requests

url = 'https://bitcoin.org/bitcoin.pdf'
response = requests.get(url)

with open('sample.pdf', 'wb') as f:
    f.write(response.content)

In this code snippet, we first import the Requests library. We then define the URL of the PDF file we want to download and use the get() method to send an HTTP GET request to the URL. The response object contains the binary content of the PDF file, which we can write to a file using the open() function.

We use the 'wb' mode to open the file in binary mode, which allows us to write the binary content of the PDF file to the file. We use the write() method to write the binary content of the PDF file to the file.

That’s it! We have successfully downloaded a PDF file from a URL using the requests library in Python.

Method 2: Utilizing the Urllib Library

Importing Urllib

The urllib library is a built-in library in Python that allows developers to interact with URLs. Before using the urllib library, developers need to import it into their Python script.

To import the urllib library, developers can use the following code:

import urllib

Downloading a PDF with Urllib

Once the urllib library is imported, you can use it to download PDFs from a URL. To download a PDF using urllib, use the urlretrieve() function, which takes two arguments: the URL of the PDF and the name of the file where the PDF will be saved.

Here’s an example:

import urllib.request

url = 'http://example.com/some_file.pdf'
filename = 'some_file.pdf'

urllib.request.urlretrieve(url, filename)

In this example, the url variable contains the URL of the PDF, and the filename variable contains the name of the file where the PDF will be saved. The urlretrieve() function downloads the PDF from the URL and saves it to the specified filename.

It’s important to note that the urlretrieve() function only works with Python 3.x. In Python 2.x, you can use the urllib2 library to download files.

Here’s an example:

import urllib2

url = 'http://example.com/some_file.pdf'
filename = 'some_file.pdf'

response = urllib2.urlopen(url)
pdf = response.read()

with open(filename, 'wb') as f:
    f.write(pdf)

In this example, the urllib2 library is used to download the PDF from the URL. The PDF is then saved to the specified filename using the open() function.

Overall, the urllib library is a useful tool for developers who need to download PDFs from URLs in their Python scripts. With the urlretrieve() function, developers can easily download PDFs and save them to a file.

Method 3: Incorporating BeautifulSoup

Integrating BeautifulSoup

BeautifulSoup is a Python library that is widely used for web scraping purposes. It is a powerful tool for devs like you and me to extract information from HTML and XML documents.

When it comes to downloading PDFs from a website, BeautifulSoup can be used in conjunction with the requests library to extract links to PDF files from the HTML source code of a website.

To start using BeautifulSoup, import it into their Python environment and use the BeautifulSoup() constructor to create a BeautifulSoup object from the HTML source code of a website. Once you have a BeautifulSoup object, use its methods to extract information from the HTML source code.

Extracting PDFs from HTML Source

To extract PDF links from the HTML source code of a website, developers can use BeautifulSoup’s find_all() method to find all the tags in the HTML source code. They can then loop through the tags and check if the href attribute of each tag points to a PDF file.

If the href attribute of a tag points to a PDF file, use the requests library to download the PDF file. Use the get() method of the requests library to send an HTTP GET request to the URL of the PDF file. The response object returned by the get() method will contain the contents of the PDF file. Then use Python’s built-in file handling functions to save the contents of the PDF file to a local file.

Handling Errors and Exceptions

Anticipating Common Errors

When downloading PDF files from URLs using Python, it is essential to anticipate common errors that may occur and prepare for them.

One common error is when the URL is invalid or the PDF file does not exist.

In such cases, the program may crash, and you won’t receive any feedback.

Another error may occur when you don’t have the necessary permissions to access the PDF file.

To anticipate such errors, one can use the os module to check if the file exists before downloading it. Additionally, one can check the response status code to ensure the request succeeded. If the status code is not 200, it means that the request was unsuccessful and the PDF file was not downloaded.

Implementing Error Handling Functions

When errors occur, handling them gracefully and providing feedback to the user is essential. One way to do this is by implementing error handling functions that catch the errors and provide feedback to the user.

One can use the try and except statements to catch errors and handle them gracefully. For example, when downloading PDF files, one can catch exceptions such as requests.exceptions.RequestException and IOError and provide feedback to the user.

Another way to handle errors is by using error codes. For example, if the user does not have the necessary permissions to access the PDF file, the program can return an error code such as 403, which indicates that the user is forbidden from accessing the file.

Organizing Downloaded PDFs

After downloading PDF files using Python, organizing them properly for easy access and management is important. This section will cover how to create a directory to store downloaded PDFs and how to save the PDFs to that directory.

Creating a Directory

To create a directory to store downloaded PDFs, Python’s os module can be used. The os module provides a way to interact with the file system and create directories.

Here is an example code snippet that creates a directory called “PDFs” in the current working directory:

import os

directory = "PDFs"
if not os.path.exists(directory):
    os.makedirs(directory)

This code checks if a directory named “PDFs” already exists in the current working directory. If it doesn’t exist, it creates the directory using the os.makedirs() function.

Saving PDFs to a Directory

Once a directory has been created to store downloaded PDFs, the next step is to save the PDFs to that directory.

Here is an example code snippet that downloads a sample PDF file and saves it to the “PDFs” directory:

import requests

url = "https://example.com/sample.pdf"
response = requests.get(url)

filename = "sample.pdf"
filepath = os.path.join("PDFs", filename)

with open(filepath, "wb") as f:
    f.write(response.content)

This code downloads a sample PDF file from the URL provided and saves it to a file named “sample.pdf” in the “PDFs” directory. The os.path.join() function is used to create the full path to the file by joining the directory name and filename together.

Frequently Asked Questions

How can I download a PDF file from a URL using Python?

There are several ways to download a PDF file from a URL using Python. One of the most popular ways is to use the requests module. This module allows you to send HTTP requests using Python, which can be used to download files from a URL. You can also use the urllib module to download files from a URL.

What is the best way to download a PDF file from a website using Python?

The best way to download a PDF file from a website using Python depends on the specific website and the structure of the website. However, using the requests module is a popular method to download files from a website. You can also use the urllib module to download files from a website.

How do I save a PDF file in Python after downloading it from a URL?

After downloading a PDF file from a URL using Python, you can save it to a directory by using the open() function and the write() method. You will need to specify the file name and the directory where you want to save the file.

What is the easiest way to download a PDF file using requests in Python?

The easiest way to download a PDF file using requests in Python is to use the get() method of the requests module. You will need to specify the URL of the file you want to download and the directory where you want to save the file.

How can I scrape a PDF file from a website using BeautifulSoup in Python?

You can scrape a PDF file from a website using BeautifulSoup in Python by first finding the URL of the PDF file on the website. Once you have the URL, you can use the requests module to download the file and then save it to a directory using the open() function and the write() method.

What is the most efficient way to download a file from a URL and save it to a directory using Python?

The most efficient way to download a file from a URL and save it to a directory using Python is to use the requests module. This module allows you to send HTTP requests using Python, which can be used to download files from a URL. You can also use the urllib module to download files from a URL.

The post 3 Pythonic Ways to Download a PDF from a URL appeared first on Be on the Right Side of Change.

Solving Response [403] HTTP Forbidden Error: Scraping SEC EDGAR

Emily Rosemary Collins — Tue, 13 Jun 2023 13:52:58 +0000

The Securities and Exchange Commission’s (SEC) Electronic Data Gathering, Analysis, and Retrieval system, known as EDGAR, serves as a rich source of information. This comprehensive database houses financial reports and statements that companies are legally required to disclose, such as a quarterly report filed by institutional investment managers.

However, when attempting to extract data from EDGAR via web scraping, you might encounter a stumbling block: an HTTPError that reads, “HTTP Error 403: Forbidden.”

This is a common issue faced by many data enthusiasts and researchers trying to access data programmatically from the EDGAR database.

Understanding the Error

HTTP Error 403, often termed as a ‘Forbidden’ error, is an HTTP status code signifying that the server understood the request but refuses to authorize it. This doesn’t necessarily mean the requester did something wrong; rather, it implies that accessing the required resource is forbidden for some reason.

Screenshot: Accessing the page may work in the browser but not in your Python code.

When you encounter an HTTP 403 error while accessing the EDGAR 13F filings, it means the EDGAR server has denied your request to download the data. This is typically because the request appears to be from a script or a bot rather than a human using a web browser.

Bypassing the Error

One common workaround for the 403 error is to modify the HTTP request’s user-agent header to imitate a web browser. Web servers use the user-agent header to identify the client making the request and can sometimes restrict access based on this information.

Here is a Python example using the requests library:

import requests

url = 'https://www.sec.gov/Archives/edgar/data/.../' # Put your target URL here
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)

In this example, we set the User-Agent to mimic a common web browser, effectively tricking the server into treating the script as a regular user.

Caution and Consideration

While this technique may help bypass the 403 error, it’s crucial to emphasize that it should be used responsibly. The SEC might have legitimate reasons for preventing certain types of access to their system. Overuse or misuse of this workaround might lead to IP blocking or other consequences.

Moreover, remember that it’s important to respect the terms of service of the website you’re accessing and adhere to any rate limits or access restrictions. Before you use scraping techniques, it’s advisable to review the SEC’s EDGAR access rules and usage guidelines.

Recommended: Is Web Scraping Legal?

The post Solving Response [403] HTTP Forbidden Error: Scraping SEC EDGAR appeared first on Be on the Right Side of Change.

Python Web Scraping: From URL to CSV in No Time

Chris — Sun, 23 Apr 2023 18:35:54 +0000

Setting up the Environment

Before diving into web scraping with Python, set up your environment by installing the necessary libraries.

First, install the following libraries: requests, BeautifulSoup, and pandas. These packages play a crucial role in web scraping, each serving different purposes.

To install these libraries, click on the previously provided links for a full guide (including troubleshooting) or simply run the following commands:

pip install requests
pip install beautifulsoup4
pip install pandas

The requests library will be used to make HTTP requests to websites and download the HTML content. It simplifies the process of fetching web content in Python.

BeautifulSoup is a fantastic library that helps extract data from the HTML content fetched from websites. It makes navigating, searching, and modifying HTML easy, making web scraping straightforward and convenient.

Pandas will be helpful in data manipulation and organizing the scraped data into a CSV file. It provides powerful tools for working with structured data, making it popular among data scientists and web scraping enthusiasts.

Fetching and Parsing URL

Next, you’ll learn how to fetch and parse URLs using Python to scrape data and save it as a CSV file. We will cover sending HTTP requests, handling errors, and utilizing libraries to make the process efficient and smooth.

Sending HTTP Requests

When fetching content from a URL, Python offers a powerful library known as the requests library. It allows users to send HTTP requests, such as GET or POST, to a specific URL, obtain a response, and parse it for information.

We will use the requests library to help us fetch data from our desired URL.

For example:

import requests
response = requests.get('https://example.com/data.csv')

The variable response will store the server’s response, including the data we want to scrape. From here, we can access the content using response.content, which will return the raw data in bytes format.

Handling HTTP Errors

Handling HTTP errors while fetching data from URLs ensures a smooth experience and prevents unexpected issues. The requests library makes error handling easy by providing methods to check whether the request was successful.

Here’s a simple example:

import requests
response = requests.get('https://example.com/data.csv')
response.raise_for_status()

The raise_for_status() method will raise an exception if there’s an HTTP error, such as a 404 Not Found or 500 Internal Server Error. This helps us ensure that our script doesn’t continue to process erroneous data, allowing us to gracefully handle any issues that may arise.

With these tools, you are now better equipped to fetch and parse URLs using Python. This will enable you to effectively scrape data and save it as a CSV file.

Extracting Data from HTML

In this section, we’ll discuss extracting data from HTML using Python. The focus will be on utilizing the BeautifulSoup library, locating elements by their tags, and attributes.

Using BeautifulSoup

BeautifulSoup is a popular Python library that simplifies web scraping tasks by making it easy to parse and navigate through HTML. To get started, import the library and request the page content you want to scrape, then create a BeautifulSoup object to parse the data:

from bs4 import BeautifulSoup
import requests

url = "example_website"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

Now you have a BeautifulSoup object and can start extracting data from the HTML.

Locating Elements by Tags and Attributes

BeautifulSoup provides various methods to locate elements by their tags and attributes. Some common methods include find(), find_all(), select(), and select_one().

Let’s see these methods in action:

# Find the first  tag
span_tag = soup.find("span")

# Find all  tags
all_span_tags = soup.find_all("span")

# Locate elements using CSS selectors
title = soup.select_one("title")

# Find all  tags with the "href" attribute
links = soup.find_all("a", {"href": True})

These methods allow you to easily navigate and extract data from an HTML structure.

Once you have located the HTML elements containing the needed data, you can extract the text and attributes.

Here’s how:

# Extract text from a tag
text = span_tag.text

# Extract an attribute value
url = links[0]["href"]

Finally, to save the extracted data into a CSV file, you can use Python’s built-in csv module.

import csv

# Writing extracted data to a CSV file
with open("output.csv", "w", newline="") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["Index", "Title"])
    for index, link in enumerate(links, start=1):
        writer.writerow([index, link.text])

Following these steps, you can successfully extract data from HTML using Python and BeautifulSoup, and save it as a CSV file.

Organizing Data

This section explains how to create a dictionary to store the scraped data and how to write the organized data into a CSV file.

Creating a Dictionary

Begin by defining an empty dictionary that will store the extracted data elements.

In this case, the focus is on quotes, authors, and any associated tags. Each extracted element should have its key, and the value should be a list that contains individual instances of that element.

For example:

data = {
    "quotes": [],
    "authors": [],
    "tags": []
}

As you scrape the data, append each item to its respective list. This approach makes the information easy to index and retrieve when needed.

Working with DataFrames and Pandas

Once the data is stored in a dictionary, it’s time to convert it into a dataframe. Using the Pandas library, it’s easy to transform the dictionary into a dataframe where the keys become the column names and the respective lists become the rows.

Simply use the following command:

import pandas as pd

df = pd.DataFrame(data)

Exporting Data to a CSV File

With the dataframe prepared, it’s time to write it to a CSV file. Thankfully, Pandas comes to the rescue once again. Using the dataframe’s built-in .to_csv() method, it’s possible to create a CSV file from the dataframe, like this:

df.to_csv('scraped_data.csv', index=False)

This command will generate a CSV file called 'scraped_data.csv' containing the organized data with columns for quotes, authors, and tags. The index=False parameter ensures that the dataframe’s index isn’t added as an additional column.

And there you have it—a neat, organized CSV file containing your scraped data!

Handling Pagination

This section will discuss how to handle pagination while scraping data from multiple URLs using Python to save the extracted content in a CSV format. It is essential to manage pagination effectively because most websites display their content across several pages.

Looping Through Web Pages

Looping through web pages requires the developer to identify a pattern in the URLs, which can assist in iterating over them seamlessly. Typically, this pattern would include the page number as a variable, making it easy to adjust during the scraping process.

Once the pattern is identified, you can use a for loop to iterate over a range of page numbers. For each iteration, update the URL with the page number and then proceed with the scraping process. This method allows you to extract data from multiple pages systematically.

For instance, let’s consider that the base URL for every page is "https://www.example.com/listing?page=", where the page number is appended to the end.

Here is a Python example that demonstrates handling pagination when working with such URLs:

import requests
from bs4 import BeautifulSoup
import csv

base_url = "https://www.example.com/listing?page="

with open("scraped_data.csv", "w", newline="") as csvfile:
    csv_writer = csv.writer(csvfile)
    csv_writer.writerow(["Data_Title", "Data_Content"])  # Header row

    for page_number in range(1, 6):  # Loop through page numbers 1 to 5
        url = base_url + str(page_number)
        response = requests.get(url)
        soup = BeautifulSoup(response.text, "html.parser")
        
        # TODO: Add scraping logic here and write the content to CSV file.

In this example, the script iterates through the first five pages of the website and writes the scraped content to a CSV file. Note that you will need to implement the actual scraping logic (e.g., extracting the desired content using Beautiful Soup) based on the website’s structure.

Handling pagination with Python allows you to collect more comprehensive data sets, improving the overall success of your web scraping efforts. Make sure to respect the website’s robots.txt rules and rate limits to ensure responsible data collection.

Exporting Data to CSV

You can export web scraping data to a CSV file in Python using the Python CSV module and the Pandas to_csv function. Both approaches are widely used and efficiently handle large amounts of data.

Python CSV Module

The Python CSV module is a built-in library that offers functionalities to read from and write to CSV files. It is simple and easy to use. To begin with, first, import the csv module.

import csv

To write the scraped data to a CSV file, open the file in write mode ('w') with a specified file name, create a CSV writer object, and write the data using the writerow() or writerows() methods as required.

with open('data.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(["header1", "header2", "header3"])
    writer.writerows(scraped_data)

In this example, the header row is written first, followed by the rows of data obtained through web scraping.

Using Pandas to_csv()

Another alternative is the powerful library Pandas, often used in data manipulation and analysis. To use it, start by importing the Pandas library.

import pandas as pd

Pandas offers the to_csv() method, which can be applied to a DataFrame. If you have web-scraped data and stored it in a DataFrame, you can easily export it to a CSV file with the to_csv() method, as shown below:

dataframe.to_csv('data.csv', index=False)

In this example, the index parameter is set to False to exclude the DataFrame index from the CSV file.

The Pandas library also provides options for handling missing values, date formatting, and customizing separators and delimiters, making it a versatile choice for data export.

10 Minutes to Pandas in 5 Minutes

If you’re just getting started with Pandas, I’d recommend you check out our free blog guide (it’s only 5 minutes!):

The post Python Web Scraping: From URL to CSV in No Time appeared first on Be on the Right Side of Change.

How to Access the First, Second, or N-th Child Div Element in BeautifulSoup?

Chris — Tue, 21 Mar 2023 11:03:05 +0000

To access the first, second, or N-th child div element in BeautifulSoup, use the .contents or .find_all() methods on a parent div element. The .contents method returns a list of children, including tags and strings, while .find_all() returns a list of matching tags only. Simply select the desired index to obtain the child div element you need.

In Beautiful Soup, you can navigate to the first, second, or third div within a parent div using the .contents or .find_all() methods.

Here’s an example:

from bs4 import BeautifulSoup

html = """

    First child div
    Second child div
    Third child div

"""

soup = BeautifulSoup(html, 'html.parser')

# Find the parent div
parent_div = soup.find('div', {'id': 'parent-div'})

# Method 1: Using .contents
first_child_div = parent_div.contents[1]
second_child_div = parent_div.contents[3]
third_child_div = parent_div.contents[5]

print("Using .contents:")
print("First child div:", first_child_div.text)
print("Second child div:", second_child_div.text)
print("Third child div:", third_child_div.text)

# Method 2: Using .find_all()
all_child_divs = parent_div.find_all('div', {'class': 'child-div'})

print("\nUsing .find_all():")
print("First child div:", all_child_divs[0].text)
print("Second child div:", all_child_divs[1].text)
print("Third child div:", all_child_divs[2].text)

The output of this script is:

Using .contents:
First child div: First child div
Second child div: Second child div
Third child div: Third child div

Using .find_all():
First child div: First child div
Second child div: Second child div
Third child div: Third child div

Note:

The .contents solution returns a list of the parent element’s children, including tags and strings. Note that the indexing numbers are shifted using this solution, i.e., the first element is indexed using .contents[1], the second with .content[3], and the n-th with .contents[2*n-1].

The .find_all() solution returns a list of matching tags only.

You can use either method to navigate to the first, second, or third div within a parent div.

Keep Learning

If you want to learn BeautifulSoup from scratch, I’d recommend you check out our academy course:

The post How to Access the First, Second, or N-th Child Div Element in BeautifulSoup? appeared first on Be on the Right Side of Change.

I Built a Kids’ Movie Ratings Database Using Beautiful Soup

Stephen Schwaner — Thu, 09 Mar 2023 13:14:48 +0000

Project Motivation

My wife and I are pretty discerning about which movies we allow our two daughters (ages 4 and 5) to watch.

Recently, we were in conversation with their teachers at school about assembling a good list of age-appropriate movies. To simplify the process, I decided to build a database of movie ratings that is easily sortable/filterable by scraping information from relevant websites.

There are a few websites that we use to determine whether a movie is age-appropriate, but one of our favorites is Kids-In-Mind, so I decided to start there. Kids-In-Mind provides a ranking from 0 (none) to 10 (extreme) for a movie’s sex, violence, and foul language content. I set out to pull all of these ratings and condense them into a single Excel sheet that I could sort and filter however I like.

What You Will Learn

This article is written for someone familiar with Python, but who is a beginner at web scraping. This HTML cheat sheet may be a helpful resource for quickly looking up different HTML tags.

In this article, you will learn how I:

Came up with a plan for scraping data from Kids-In-Mind
Examined the HTML for the relevant web pages
Used BeautifulSoup to parse the HTML for movie rating information
Handled variations in how pages were organized
Used pandas to write the resulting data to a CSV file

In the rest of the article, I will abbreviate BeautifulSoup as bs4.

You can download the full script here https://github.com/finxter/WebScrapeKidsMovies. I also attach the full script to the end of this page, so keep reading!

Planning the Scraping Approach

First, things first, how should I get started? When I visited the Kids-In-Mind home page, I noticed that they have a link to an “A-Z Index.” Jackpot! I realized I could visit each “letter” page and either follow links or pull information to get the data I needed.

I was pleasantly surprised again when I visited the “A” page. The title, MPAA rating, year, and content ratings were all contained right there on the page! I decided to pull the HTML from each “letter” page and then parse that HTML to scrape information for each movie.

Clicking on the links to the “A” and “B” pages took me to the following URLs:

As you can see, simply exchanging the “a” for the “b” allowed me to navigate to each “letter” page on the site. This is how I decided to iterate through pages to pull information for all the movies on the site.

To proceed, I still needed to figure out how each page was structured. I right-clicked on the first movie (Abandon) and selected the “Inspect” option (I’m using Google Chrome).

You can see that:

The list of movies is contained within a
tag with an attribute class = "et_pb_text_inner" (1),
The link and movie titles are each contained with an tag (2), and

and the year and ratings are contained within text trailing each tag (3).

Note: Since I’m new to HTML, I initially thought the text with rating information was associated with each tag. Upon closer inspection using BeautifulSoup, I found out that the text was actually associated with the

tag. You’ll see that in the code, further down.

In addition to the number of ratings for each content category, I also wanted to pull more detailed information about the sex content.

Since my kids are so young, sometimes even movies with low sex ratings can be inappropriate for them. For example, the movie might be aimed at a 10-year-old even though it is rated G with a sex rating of 1.

To get this content, I needed to follow each movie link to that movie’s page. I clicked on the “The Adventures of Rocky and Bullwinkle” link and used the “Inspect” tool to check out the HTML defining the movie’s “Sex/Nudity” section.

You can see:

There is a tag (2) nested inside of a
tag (1),
the tag contains the paragraph heading, “Sex/Nudity” (3),
and the text (4) trails the tag.

Now that I had visited a few relevant pages from the site and inspected the underlying HTML, I was able to define a general approach:

Scrape movie titles and ratings:

Loop through each “letter” page and pull the HTML

Use BeautifulSoup to find all
tags with class = "et_pb_text_inner"
Determine which
tag contains the list of movies
Get the text from the
tag and parse it for movies names and information
Loop through each nested tag and get the URL leading to each movie page (the value of the href attribute)

Scrape sexual content description:

Follow the href attribute contained in each tag (contains the link to that movie’s page)

Use BeautifulSoup to find all

Scraping the Movie Titles and Ratings

The import statements needed for the code shown in this section are:

import requests
from bs4 import BeautifulSoup
import time
from urllib.parse import urljoin

I decided to call the main function scrape_kim_ratings(), and I gave it an input of all of the letter pages I wanted to scrape.

Next, I initialized the dictionary containing all the movie information, which would be converted to a pandas data frame.

The dictionary keys become the data frame column titles after conversion:

def scrape_kim_ratings(letters):
   movie_dict = {"title": [],
                 "year": [],
                 "mpaa": [],
                 "KIM sex": [],
                 "KIM violence": [],
                 "KIM language": [],
                 "KIM sex content": []}

Next, I defined a for loop to loop through each letter page and pull the HTML from each page using the requests.get() method. Once I had the HTML, I used BeatifulSoup to find all

tags with an attribute class = "et_pb_text_inner":

   for letter in letters:
       # Get a response from each letter page
       url = f"https://kids-in-mind.com/{letter}.htm"
       res = requests.get(url)


       if res:
           # Get the HTML from that page
           soup = BeautifulSoup(res.text, "html.parser")
           # The list of movies is in a div tag with class = et_pb_text_inner
           div = soup.findAll("div", class_="et_pb_text_inner")

As it turns out, the letter pages contained multiple tags matching these criteria, so I had to figure out which tag contained the list of movies.

You’ll see that I looped through each of the div tags (for entry in div:), used the bs4 getText() method to pull the entry’s text, and looked to see if the text contained “Movie Reviews by Title.”

The next tag contained the list of movies – I had figured this out by inspecting the HTML of a few of the letter pages. In the code below, idx is the index of the tag containing the list of movies:

           # Find the list of movies. It comes after "Movie Reviews by Title"
           idx = 0
           for entry in div:
               text = entry.getText()
               if "Movie Reviews by Title" in text:
                   idx += 1
                   break
               idx += 1

Next, I used the bs4 getText() method to get a string of all the text from the

tag with the list of movies. The object stored in div[idx] is an instance of the bs4.element.Tag class, which means we can think of it as a

tag that can be parsed and manipulated with bs4 functions and methods.

You can use Python’s type() function to determine this. I used the type() function heavily while I was figuring out how the bs4 functions worked and what their outputs were.

All the movies were separated by newline characters, so I used the split() method to get a list containing a different movie in each entry:

           # All movies on the page, separated by \n 
           # (movie names with ratings are stored as text of the div tag)
           movies = div[idx].getText().split("\n")

To be honest, at first, I didn’t know that all the movies were stored as text within the

tag. I thought I was going to have to pull the text from each tag within the

tag.

However, using the PyCharm debugger to play around with div[idx], I discovered that pulling the text from the

tag provided me with the movie information.

Next, I needed to get the links that would take me to each movie page. I used the findAll() method to get all tags and then used the urljoin() function to join the URL of the current “letter” web page (like https://kids-in-mind.com/a.htm) with the relative link to the movie page (like /a/abandon.htm).

An example result is https://kids-in-mind.com/a/abandon.htm. I used list comprehension to put them all in a list, links:

           # href links to each movie page are stored in a tags
           a = div[idx].findAll("a")
           links = [urljoin(url, x["href"]) for x in a]

Now I had all of the movie rating information for a given letter page and all the links to the movie pages. The next steps were to:

Parse each string in movies for each rating and other pieces of information
Follow each link in links and parse the sexual content

To make it easier to loop through both lists at once, I used the zip() function:

           # zip these up to make iteration easier in the for loop
           movies_and_links = list(zip(movies, links))

Next, I looped through each movie and each link. First, I parsed the string in movie for the year, MPAA rating, Kids In Mind ratings, and the movie title using a function that I defined called parse_movie():

           for movie, link in movies_and_links:
               # get the information available in the list on each letter page
               year, mpaa, ratings, title = parse_movie(movie)
               print(f"Title is {title}")

This function took a bit of trial and error to write.

At first, I thought all of the strings were formatted like, "Abandon [2002] [PG-13] – 4.4.4".

However, after running the code once, I saw that some of the strings were formatted like this, "Abandon [Foreign Name] [2002] [PG-13] – 4.4.4", with an additional set of brackets containing the film’s name in a different language.

I had to add the code block at the very beginning of the function to skip over this set of brackets.

You can see that the two main functions I used were the string methods find() (to find the brackets) and split() (to isolate the Kids In Mind ratings).

The last tricky bit that gave me trouble was that sometimes the Kids In Mind ratings were separated by an en dash and other times were separated by an em dash:

def parse_movie(movie):
   # some entries had a foreign name in brackets
   if movie.count("]") > 2:
       start_idx = movie.find("]") + 1
   else:
       start_idx = 0

   # year is usually in the first set of brackets
   year_idx1 = movie.find("[", start_idx)
   year_idx2 = movie.find("]", start_idx)

   # mpaa rating was next
   mpaa_idx1 = movie.find("[", year_idx1 + 1)
   mpaa_idx2 = movie.find("]", year_idx2 + 1)

   year = int(movie[year_idx1 + 1:year_idx2].strip())
   mpaa = movie[mpaa_idx1 + 1:mpaa_idx2]

   # the ratings came after a dash and were formatted like #.#.#
   ratings_split = movie.split("–")
   # sometimes they used a dash, sometimes an en dash
   if len(ratings_split) == 1:
       ratings_split = movie.split("-")

   ratings = [int(x) for x in ratings_split[-1].split(".")]

   title = movie[0:year_idx1]

   return year, mpaa, ratings, title

Scrape sexual content description

The additional import statements needed for the code in this section are:

import bs4.element
import time
import random

After parsing movie, it was time to follow the link to the movie’s page and pull a more detailed description of sexual content using the function scrape_kim_sexcontent().

Since this was going to require making many “get” requests to the Kids In Mind website, I also added a variable time delay in between each request using the time.sleep() function. I did this for two reasons:

It’s good practice to add some sort of delay between requests so that you do not overload the website’s server.
Adding a bit of random variation to the time delays can trick the web server into thinking your web scraping script is a human, making it less likely to reject subsequent requests.

Code:

# follow each movie link to get the sex content description
               start = time.time()
               sex_content = scrape_kim_sexcontent(link)
               delay = time.time() - start

               wait_time = random.uniform(.5, 2) * delay
               print(f'Just finished {title}')
               print(f'wait time is {wait_time}')
               time.sleep(wait_time)

Scraping the detailed descriptions proved a bit trickier than getting the Kids In Mind ratings. As I mentioned above, I planned to use the bs4 object method findAll() to get all of the

tags and find the one that contained sexual content.

Below is the first iteration of my scrape_kim_sexcontent() function:

def scrape_kim_sexcontent(url):
   # Request html from page and find all p tags
   res = requests.get(url)
   soup = BeautifulSoup(res.text, 'html.parser')
   res.close()
   p_set = soup.findAll("p")


   for entry in p_set:
       if 'SEX/NUDITY' in entry.text:
           sex_content = entry.text
           break

return sex_content

However, I quickly realized that some of the movie pages were organized differently. The screenshot below shows a resulting CSV file. You can see that the script pulled a paragraph from the right side of the web page instead of the sexual content paragraph.

It turns out that some of the movie pages, like the one for Abominable, had the title and text “SEX/NUDITY” in an

tag preceding the
tag that contained the detailed description.

To handle this variation, I added some code. The final version of `scrape_kim_sexcontent()` is below. First, I looked for all of the

tags. Then I looped through them until I found one with an id attribute equal to “sex”. I used the `bs4.element.tag` attribute, `attrs`, to access each tag’s attributes as a dictionary.

If you take another look at the Abominable page HTML, you can see that the
tag containing the sexual content details is at the same level as the preceding

tag rather than being nested within it.

This means that the
tag is a sibling of the

tag, not its child. Thus, I was able to access it using the bs4.element.tag attribute next_siblings which returns a list of the siblings that follow the

tag.

Finally, I used the bs4.element.tag attribute text to get the paragraph I wanted:

def scrape_kim_sexcontent(url):
   # Request html from page and find all h2 tags
   res = requests.get(url)
   soup = BeautifulSoup(res.text, 'html.parser')
   res.close()
   h2_set = soup.findAll("h2")

   # Initialize
   sex_content = ""

   # Check the  tags (headers). If you find id="sex", grab the next paragraph (p tag)
   sibling_iter = []
   for entry in h2_set:
       if "id" in entry.attrs:
           if entry["id"] == "sex":
               sibling_iter = entry.next_siblings

               # Grab the next paragraph
               for sibling in sibling_iter:
                   if type(sibling) == bs4.element.Tag:
                       sex_content = sibling.text

   # Sometimes header 
 tags aren't used to make the paragraph headers
   # If you haven't found sex content yet, search all the p tags for "SEX/NUDITY"
   if sex_content == "":
       p_set = soup.findAll("p")

       for entry in p_set:
           if 'SEX/NUDITY' in entry.text:
               sex_content = entry.text
               break

   return sex_content

Organize Data and Save it to a File

The additional import statements needed for the code in this section are:

import pandas as pd
import string

Finally, it was time to organize the scraped data and save it to a CSV file.

I decided to use the pandas library since its to_csv data frame method makes it super easy to save data to a CSV file.

First, after parsing the information for each movie, I saved each piece of data in a dictionary. After each “letter” page was completed, I converted the growing dictionary to a pandas data frame using the pd.DataFrame() method and then saved the resulting data frame to a CSV file.

I decided to write to the CSV file after each “letter” page was completed to make sure that I would have data saved if the web scraping script was interrupted for some reason:

               # Build dictionary for conversion to data frame
               movie_dict["title"].append(title)
               movie_dict["year"].append(year)
               movie_dict["mpaa"].append(mpaa)
               movie_dict["KIM sex"].append(ratings[0])
               movie_dict["KIM violence"].append(ratings[1])
               movie_dict["KIM language"].append(ratings[2])
               movie_dict["KIM sex content"].append(sex_content)

           res.close()

           # Write to the CSV after every letter
           print("\n")
           print("Writing to Movies.csv")
           df_movies = pd.DataFrame(movie_dict)
           df_movies.to_csv("Movies.csv")

           print(f"Done with {letter}. Waiting {wait_time} seconds")
           time.sleep(wait_time)

       else:
           print(f"Error: {res}")

   return df_movies

Lastly, I called the main function scrape_kim_ratings() and provided a list of all the lowercase ASCII letters:

df_movies = scrape_kim_ratings(string.ascii_lowercase)

Conclusion

So, there you have it! Here is a link to the GitHub page with the full script https://github.com/finxter/WebScrapeKidsMovies. I’ll also attach it at the end of this article.

In the future, I think I will add functions to the script that will pull information from other websites and add them to the current database. I think I will also add a function that checks the websites for any new movies/ratings and adds them to the current database.

I hope this will inspire you to write your own web scraping script!

The Script

import pandas as pd
import requests
from bs4 import BeautifulSoup
import bs4.element
import string
import time
from urllib.parse import urljoin
import random


def scrape_kim_sexcontent(url):
    # Request html from page and find all h2 tags
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')
    res.close()
    h2_set = soup.findAll("h2")

    # Initialize
    sex_content = ""

    # Check the  tags (headers). If you find id="sex", grab the next paragraph (p tag)
    sibling_iter = []
    for entry in h2_set:
        if "id" in entry.attrs:
            if entry["id"] == "sex":
                sibling_iter = entry.next_siblings

                # Grab the next paragraph
                for sibling in sibling_iter:
                    if type(sibling) == bs4.element.Tag:
                        sex_content = sibling.text

    # Sometimes header 
 tags aren't used to make the paragraph headers
    # If you haven't found sex content yet, search all the p tags for "SEX/NUDITY"
    if sex_content == "":
        p_set = soup.findAll("p")

        for entry in p_set:
            if 'SEX/NUDITY' in entry.text:
                sex_content = entry.text
                break

    return sex_content


def parse_movie(movie):
    # some entries had a foreign name in brackets
    if movie.count("]") > 2:
        start_idx = movie.find("]") + 1
    else:
        start_idx = 0

    # year is usually in the first set of brackets
    year_idx1 = movie.find("[", start_idx)
    year_idx2 = movie.find("]", start_idx)

    # mpaa rating was next
    mpaa_idx1 = movie.find("[", year_idx1 + 1)
    mpaa_idx2 = movie.find("]", year_idx2 + 1)

    year = int(movie[year_idx1 + 1:year_idx2].strip())
    mpaa = movie[mpaa_idx1 + 1:mpaa_idx2]

    # the ratings came after a dash and were formatted like #.#.#
    ratings_split = movie.split("–")
    # sometimes they used a dash, sometimes an en dash
    if len(ratings_split) == 1:
        ratings_split = movie.split("-")

    ratings = [int(x) for x in ratings_split[-1].split(".")]

    title = movie[0:year_idx1]

    return year, mpaa, ratings, title


def scrape_kim_ratings(letters):
    movie_dict = {"title": [],
                  "year": [],
                  "mpaa": [],
                  "KIM sex": [],
                  "KIM violence": [],
                  "KIM language": [],
                  "KIM sex content": []}

    for letter in letters:
        # Get a response from each letter page
        url = f"https://kids-in-mind.com/{letter}.htm"
        res = requests.get(url)

        if res:
            # Get the HTML from that page
            soup = BeautifulSoup(res.text, "html.parser")
            # The list of movies is in a div tag with class = et_pb_text_inner
            div = soup.findAll("div", class_="et_pb_text_inner")

            # Find the list of movies. It comes after "Movie Reviews by Title"
            idx = 0
            for entry in div:
                text = entry.getText()
                if "Movie Reviews by Title" in text:
                    idx += 1
                    break
                idx += 1

            # All movies on the page, separated by \n (movie names with ratings are stored as text of the div tag)
            movies = div[idx].getText().split("\n")

            # href links to each movie page are stored in a tags
            a = div[idx].findAll("a")
            links = [urljoin(url, x["href"]) for x in a]

            # zip these up to make iteration easier in the for loop
            movies_and_links = list(zip(movies, links))

            for movie, link in movies_and_links:
                # get the information available in the list on each letter page
                year, mpaa, ratings, title = parse_movie(movie)
                print(f"Title is {title}")

                # follow each movie link to get the sex content description
                start = time.time()
                sex_content = scrape_kim_sexcontent(link)
                delay = time.time() - start

                wait_time = random.uniform(.5, 2) * delay
                print(f'Just finished {title}')
                print(f'wait time is {wait_time}')
                time.sleep(wait_time)

                # Build dictionary for conversion to data frame
                movie_dict["title"].append(title)
                movie_dict["year"].append(year)
                movie_dict["mpaa"].append(mpaa)
                movie_dict["KIM sex"].append(ratings[0])
                movie_dict["KIM violence"].append(ratings[1])
                movie_dict["KIM language"].append(ratings[2])
                movie_dict["KIM sex content"].append(sex_content)

            res.close()

            # Write to the CSV after every letter
            print("\n")
            print("Writing to Movies.csv")
            df_movies = pd.DataFrame(movie_dict)
            df_movies.to_csv("Movies.csv")

            print(f"Done with {letter}. Waiting {wait_time} seconds")
            time.sleep(wait_time)

        else:
            print(f"Error: {res}")

    return df_movies


df_movies = scrape_kim_ratings(string.ascii_lowercase)

The post I Built a Kids’ Movie Ratings Database Using Beautiful Soup appeared first on Be on the Right Side of Change.

Python – How to Convert KML to CSV?

Chris — Thu, 18 Aug 2022 15:19:59 +0000

What is KML?

Definition: The Keyhole Markup Language (KML) is a file format for displaying geographic data in Google Earth or other so-called “Earth Browsers”. Similarly to XML, KML uses a tag-based structure with nested elements and attributes.

How to Convert KML to CSV in Python?

You can convert a .kml to a .csv file in Python by using the BeautifulSoup and the csv libraries. You use the former to read the XML-structured KML file and the latter to write the CSV file row by row.

Here’s the code example inspired but modified from this GitHub repository. You can copy&paste it in the directory where your KML file resides and change the input and output filenames at the beginning to convert your own KML to a CSV in Python:

from bs4 import BeautifulSoup
import csv


infile = 'my_file.kml'
outfile = 'my_file.csv'


with open(infile, 'r') as f:
    s = BeautifulSoup(f, 'xml')
    
    with open(outfile, 'wb') as csvfile:
        writer = csv.writer(csvfile)

        for coords in s.find_all('coordinates'):
            
            # Take coordinate string from KML and break it up into [Lat,Lon,Lat,Lon...] to get CSV row
            space_splits = coords.string.split(" ")
            row = []
            
            for split in space_splits[1:]:
                # Note: because of the space between " "-80.123, we slice [1:]
                comma_split = split.split(',')

                # lattitude
                row.append(comma_split[1])
                
                # longitude
                row.append(comma_split[0])
            
            writer.writerow(row)

Example Conversion

We use the following sample KML file as 'my_file.kml':



  
    KML Samples
    1
    Unleash your creativity with the help of these examples!
    
    
    
    
    
    
    
    
    
    
    
    
    
      Placemarks
      These are just some of the different kinds of placemarks with
        which you can mark your favorite places
      
        -122.0839597145766
        37.42222904525232
        0
        -148.4122922628044
        40.5575073395506
        500.6566641072245
      
      
        Simple placemark
        Attached to the ground. Intelligently places itself at the
          height of the underlying terrain.
        
          -122.0822035425683,37.42228990140251,0
        
      
      
        Floating placemark
        0
        Floats a defined distance above the ground.
        
          -122.0839597145766
          37.42222904525232
          0
          -148.4122922628044
          40.5575073395506
          500.6566641072245
        
        #downArrowIcon
        
          relativeToGround
          -122.084075,37.4220033612141,50
        
      
      
        Extruded placemark
        0
        Tethered to the ground by a customizable
          "tail"
        
          -122.0845787421525
          37.42215078737763
          0
          -148.4126684946234
          40.55750733918048
          365.2646606980322
        
        #globeIcon
        
          1
          relativeToGround
          -122.0857667006183,37.42156927867553,50
        
      
    
    
      Styles and Markup
      0
      With KML it is easy to create rich, descriptive markup to
        annotate and enrich your placemarks
      
        -122.0845787422371
        37.42215078726837
        0
        -148.4126777488172
        40.55750733930874
        365.2646826292919
      
      #noDrivingDirections
      
        Highlighted Icon
        0
        Place your mouse over the icon to see it display the new
          icon
        
          -122.0856552124024
          37.4224281311035
          0
          0
          0
          265.8520424250024
        
        
        
        
          
            normal
            #normalPlacemark
          
          
            highlight
            #highlightPlacemark
          
        
        
          Roll over this icon
          0
          #exampleStyleMap
          
            -122.0856545755255,37.42243077405461,0
          
        
      
      
        Descriptive HTML
        0
        

Placemark descriptions can be enriched by using many standard HTML tags.

For example:

Styles:

Italics, 
Bold, 
Underlined, 
Strike Out, 
subscript_subscript, 
superscript^superscript, 
Big, 
Small, 
Typewriter, 
Emphasized, 
Strong, 
Code

Fonts:
 
red by name, 
leaf green by hexadecimal RGB


size 1, 
size 2, 
size 3, 
size 4, 
size 5, 
size 6, 
size 7


Times, 
Verdana, 
Arial


Links: 


Google Earth!


 or:  Check out our website at www.google.com

Alignment:

left
center
right

Ordered Lists:

First
Second
Third
First
Second
Third
First
Second
Third

Unordered Lists:

A
B
C
A
B
C
A
B
C

Definitions:


Google:
The best thing since sliced bread


Centered:

Time present and time past

Are both perhaps present in time future,

And time future contained in time past.

If all time is eternally present

All time is unredeemable.



Block Quote:



We shall not cease from exploration

And the end of all our exploring

Will be to arrive where we started

And know the place for the first time.

-- T.S. Eliot




Headings:

Header 1
Header 2
Header 3
Header 4
Header 5

Images:

Remote image



Scaled image




Simple Tables:


1 2 3 4 5
a b c d e



[Did you notice that double-clicking on the placemark doesn't cause the viewer to take you anywhere? This is because it is possible to directly author a "placeless placemark". If you look at the code for this example, you will see that it has neither a point coordinate nor a LookAt element.]]]>
      
    
    
      Ground Overlays
      0
      Examples of ground overlays
      
        Large-scale overlay on terrain
        0
        Overlay shows Mount Etna erupting on July 13th, 2001.
        
          15.02468937557116
          37.67395167941667
          0
          -16.5581842842829
          58.31228652890705
          30350.36838438907
        
        
          http://developers.google.com/kml/documentation/images/etna.jpg
        
        
          37.91904192681665
          37.46543388598137
          15.35832653742206
          14.60128369746704
          -0.1556640799496235
        
      
    
    
      Screen Overlays
      0
      Screen overlays have to be authored directly in KML. These
        examples illustrate absolute and dynamic positioning in screen space.
      
        Simple crosshairs
        0
        This screen overlay uses fractional positioning to put the
          image in the exact center of the screen
        
          http://developers.google.com/kml/documentation/images/crosshairs.png
        
        
        
        
        
      
      
        Absolute Positioning: Top left
        0
        
          http://developers.google.com/kml/documentation/images/top_left.jpg
        
        
        
        
        
      
      
        Absolute Positioning: Top right
        0
        
          http://developers.google.com/kml/documentation/images/top_right.jpg
        
        
        
        
        
      
      
        Absolute Positioning: Bottom left
        0
        
          http://developers.google.com/kml/documentation/images/bottom_left.jpg
        
        
        
        
        
      
      
        Absolute Positioning: Bottom right
        0
        
          http://developers.google.com/kml/documentation/images/bottom_right.jpg
        
        
        
        
        
      
      
        Dynamic Positioning: Top of screen
        0
        
          http://developers.google.com/kml/documentation/images/dynamic_screenoverlay.jpg
        
        
        
        
        
      
      
        Dynamic Positioning: Right of screen
        0
        
          http://developers.google.com/kml/documentation/images/dynamic_right.jpg
        
        
        
        
        
      
    
    
      Paths
      0
      Examples of paths. Note that the tessellate tag is by default
        set to 0. If you want to create tessellated lines, they must be authored
        (or edited) directly in KML.
      
        Tessellated
        0
         tag has a value of 1, the line will contour to the underlying terrain]]>
        
          -112.0822680013139
          36.09825589333556
          0
          103.8120432044965
          62.04855796276328
          2889.145007690472
        
        
          1
           -112.0814237830345,36.10677870477137,0
            -112.0870267752693,36.0905099328766,0 
        
      
      
        Untessellated
        0
         tag has a value of 0, the line follow a simple straight-line path from point to point]]>
        
          -112.0822680013139
          36.09825589333556
          0
          103.8120432044965
          62.04855796276328
          2889.145007690472
        
        
          0
           -112.080622229595,36.10673460007995,0
            -112.085242575315,36.09049598612422,0 
        
      
      
        Absolute
        0
        Transparent purple line
        
          -112.2719329043177
          36.08890633450894
          0
          -106.8161545998597
          44.60763714063257
          2569.386744398339
        
        #transPurpleLineGreenPoly
        
          1
          absolute
           -112.265654928602,36.09447672602546,2357
            -112.2660384528238,36.09342608838671,2357
            -112.2668139013453,36.09251058776881,2357
            -112.2677826834445,36.09189827357996,2357
            -112.2688557510952,36.0913137941187,2357
            -112.2694810717219,36.0903677207521,2357
            -112.2695268555611,36.08932171487285,2357
            -112.2690144567276,36.08850916060472,2357
            -112.2681528815339,36.08753813597956,2357
            -112.2670588176031,36.08682685262568,2357
            -112.2657374587321,36.08646312301303,2357 
        
      
      
        Absolute Extruded
        0
        Transparent green wall with yellow outlines
        
          -112.2643334742529
          36.08563154742419
          0
          -125.7518698668815
          44.61038665812578
          4451.842204068102
        
        #yellowLineGreenPoly
        
          1
          1
          absolute
           -112.2550785337791,36.07954952145647,2357
            -112.2549277039738,36.08117083492122,2357
            -112.2552505069063,36.08260761307279,2357
            -112.2564540158376,36.08395660588506,2357
            -112.2580238976449,36.08511401044813,2357
            -112.2595218489022,36.08584355239394,2357
            -112.2608216347552,36.08612634548589,2357
            -112.262073428656,36.08626019085147,2357
            -112.2633204928495,36.08621519860091,2357
            -112.2644963846444,36.08627897945274,2357
            -112.2656969554589,36.08649599090644,2357 
        
      
      
        Relative
        0
        Black line (10 pixels wide), height tracks terrain
        
          -112.2580438551384
          36.1072674824385
          0
          4.947421249553717
          44.61324882043339
          2927.61105910266
        
        #thickBlackLine
        
          1
          relativeToGround
           -112.2532845153347,36.09886943729116,645
            -112.2540466121145,36.09919570465255,645
            -112.254734666947,36.09984998366178,645
            -112.255493345654,36.10051310621746,645
            -112.2563157098468,36.10108441943419,645
            -112.2568033076439,36.10159722088088,645
            -112.257494011321,36.10204323542867,645
            -112.2584106072308,36.10229131995655,645
            -112.2596588987972,36.10240001286358,645
            -112.2610581199487,36.10213176873407,645
            -112.2626285262793,36.10157011437219,645 
        
      
      
        Relative Extruded
        0
        Opaque blue walls with red outline, height tracks terrain
        
          -112.2683594333433
          36.09884362144909
          0
          -72.24271551768405
          44.60855445139561
          2184.193522571467
        
        #redLineBluePoly
        
          1
          1
          relativeToGround
           -112.2656634181359,36.09445214722695,630
            -112.2652238941097,36.09520916122063,630
            -112.2645079986395,36.09580763864907,630
            -112.2638827428817,36.09628572284063,630
            -112.2635746835406,36.09679275951239,630
            -112.2635711822407,36.09740038871899,630
            -112.2640296531825,36.09804913435539,630
            -112.264327720538,36.09880337400301,630
            -112.2642436562271,36.09963644790288,630
            -112.2639148687042,36.10055381117246,630
            -112.2626894973474,36.10149062823369,630 
        
      
    
    
      Polygons
      0
      Examples of polygon shapes
      
        Google Campus
        0
        A collection showing how easy it is to create 3-dimensional
          buildings
        
          -122.084120030116
          37.42174011925477
          0
          -34.82469740081282
          53.454348562403
          276.7870053764046
        
        
          Building 40
          0
          #transRedPoly
          
            1
            relativeToGround
            
              
                 -122.0848938459612,37.42257124044786,17
                  -122.0849580979198,37.42211922626856,17
                  -122.0847469573047,37.42207183952619,17
                  -122.0845725380962,37.42209006729676,17
                  -122.0845954886723,37.42215932700895,17
                  -122.0838521118269,37.42227278564371,17
                  -122.083792243335,37.42203539112084,17
                  -122.0835076656616,37.42209006957106,17
                  -122.0834709464152,37.42200987395161,17
                  -122.0831221085748,37.4221046494946,17
                  -122.0829247374572,37.42226503990386,17
                  -122.0829339169385,37.42231242843094,17
                  -122.0833837359737,37.42225046087618,17
                  -122.0833607854248,37.42234159228745,17
                  -122.0834204551642,37.42237075460644,17
                  -122.083659133885,37.42251292011001,17
                  -122.0839758438952,37.42265873093781,17
                  -122.0842374743331,37.42265143972521,17
                  -122.0845036949503,37.4226514386435,17
                  -122.0848020460801,37.42261133916315,17
                  -122.0847882750515,37.42256395055121,17
                  -122.0848938459612,37.42257124044786,17 
              
            
          
        
        
          Building 41
          0
          #transBluePoly
          
            1
            relativeToGround
            
              
                 -122.0857412771483,37.42227033155257,17
                  -122.0858169768481,37.42231408832346,17
                  -122.085852582875,37.42230337469744,17
                  -122.0858799945639,37.42225686138789,17
                  -122.0858860101409,37.4222311076138,17
                  -122.0858069157288,37.42220250173855,17
                  -122.0858379542653,37.42214027058678,17
                  -122.0856732640519,37.42208690214408,17
                  -122.0856022926407,37.42214885429042,17
                  -122.0855902778436,37.422128290487,17
                  -122.0855841672237,37.42208171967246,17
                  -122.0854852065741,37.42210455874995,17
                  -122.0855067264352,37.42214267949824,17
                  -122.0854430712915,37.42212783846172,17
                  -122.0850990714904,37.42251282407603,17
                  -122.0856769818632,37.42281815323651,17
                  -122.0860162273783,37.42244918858722,17
                  -122.0857260327004,37.42229239604253,17
                  -122.0857412771483,37.42227033155257,17 
              
            
          
        
        
          Building 42
          0
          #transGreenPoly
          
            1
            relativeToGround
            
              
                 -122.0857862287242,37.42136208886969,25
                  -122.0857312990603,37.42136935989481,25
                  -122.0857312992918,37.42140934910903,25
                  -122.0856077073679,37.42138390166565,25
                  -122.0855802426516,37.42137299550869,25
                  -122.0852186221971,37.42137299504316,25
                  -122.0852277765639,37.42161656508265,25
                  -122.0852598189347,37.42160565894403,25
                  -122.0852598185499,37.42168200156,25
                  -122.0852369311478,37.42170017860346,25
                  -122.0852643957828,37.42176197982575,25
                  -122.0853239032746,37.42176198013907,25
                  -122.0853559454324,37.421852864452,25
                  -122.0854108752463,37.42188921823734,25
                  -122.0854795379357,37.42189285337048,25
                  -122.0855436229819,37.42188921797546,25
                  -122.0856260178042,37.42186013499926,25
                  -122.085937287963,37.42186013453605,25
                  -122.0859428718666,37.42160898590042,25
                  -122.0859655469861,37.42157992759144,25
                  -122.0858640462341,37.42147115002957,25
                  -122.0858548911215,37.42140571326184,25
                  -122.0858091162768,37.4214057134039,25
                  -122.0857862287242,37.42136208886969,25 
              
            
          
        
        
          Building 43
          0
          #transYellowPoly
          
            1
            relativeToGround
            
              
                 -122.0844371128284,37.42177253003091,19
                  -122.0845118855746,37.42191111542896,19
                  -122.0850470999805,37.42178755121535,19
                  -122.0850719913391,37.42143663023161,19
                  -122.084916406232,37.42137237822116,19
                  -122.0842193868167,37.42137237801626,19
                  -122.08421938659,37.42147617161496,19
                  -122.0838086419991,37.4214613409357,19
                  -122.0837899728564,37.42131306410796,19
                  -122.0832796534698,37.42129328840593,19
                  -122.0832609819207,37.42139213944298,19
                  -122.0829373621737,37.42137236399876,19
                  -122.0829062425667,37.42151569778871,19
                  -122.0828502269665,37.42176282576465,19
                  -122.0829435788635,37.42176776969635,19
                  -122.083217411188,37.42179248552686,19
                  -122.0835970430103,37.4217480074456,19
                  -122.0839455556771,37.42169364237603,19
                  -122.0840077894637,37.42176283815853,19
                  -122.084113587521,37.42174801104392,19
                  -122.0840762473784,37.42171341292375,19
                  -122.0841447047739,37.42167881534569,19
                  -122.084144704223,37.42181720660197,19
                  -122.0842503333074,37.4218170700446,19
                  -122.0844371128284,37.42177253003091,19 
              
            
          
        
      
      
        Extruded Polygon
        A simple way to model a building
        
          The Pentagon
          
            -77.05580139178142
            38.870832443487
            59.88865561738225
            48.09646074797388
            742.0552506670548
          
          
            1
            relativeToGround
            
              
                 -77.05788457660967,38.87253259892824,100
                  -77.05465973756702,38.87291016281703,100
                  -77.05315536854791,38.87053267794386,100
                  -77.05552622493516,38.868757801256,100
                  -77.05844056290393,38.86996206506943,100
                  -77.05788457660967,38.87253259892824,100 
              
            
            
              
                 -77.05668055019126,38.87154239798456,100
                  -77.05542625960818,38.87167890344077,100
                  -77.05485125901024,38.87076535397792,100
                  -77.05577677433152,38.87008686581446,100
                  -77.05691162017543,38.87054446963351,100
                  -77.05668055019126,38.87154239798456,100 
              
            
          
        
      
      
        Absolute and Relative
        0
        Four structures whose roofs meet exactly. Turn on/off
          terrain to see the difference between relative and absolute
          positioning.
        
          -112.3348969157552
          36.14845533214919
          0
          -86.91235037566909
          49.30695423894192
          990.6761201087104
        
        
          Absolute
          0
          #transBluePoly
          
            1
            absolute
            
              
                 -112.3372510731295,36.14888505105317,1784
                  -112.3356128688403,36.14781540589019,1784
                  -112.3368169371048,36.14658677734382,1784
                  -112.3384408457543,36.14762778914076,1784
                  -112.3372510731295,36.14888505105317,1784 
              
            
          
        
        
          Absolute Extruded
          0
          #transRedPoly
          
            1
            1
            absolute
            
              
                 -112.3396586818843,36.14637618647505,1784
                  -112.3380597654315,36.14531751871353,1784
                  -112.3368254237788,36.14659596244607,1784
                  -112.3384555043203,36.14762621763982,1784
                  -112.3396586818843,36.14637618647505,1784 
              
            
          
        
        
          Relative
          0
          
            -112.3350152490417
            36.14943123077423
            0
            -118.9214100848499
            37.92486261093203
            345.5169113679813
          
          #transGreenPoly
          
            1
            relativeToGround
            
              
                 -112.3349463145932,36.14988705767721,100
                  -112.3354019540677,36.14941108398372,100
                  -112.3344428289146,36.14878490381308,100
                  -112.3331289492913,36.14780840132443,100
                  -112.3317019516947,36.14680755678357,100
                  -112.331131440106,36.1474173426228,100
                  -112.332616324338,36.14845453364654,100
                  -112.3339876620524,36.14926570522069,100
                  -112.3349463145932,36.14988705767721,100 
              
            
          
        
        
          Relative Extruded
          0
          
            -112.3351587892382
            36.14979247129029
            0
            -55.42811560891606
            56.10280503739589
            401.0997279712519
          
          #transYellowPoly
          
            1
            1
            relativeToGround
            
              
                 -112.3348783983763,36.1514008468736,100
                  -112.3372535345629,36.14888517553886,100
                  -112.3356068927954,36.14781612679284,100
                  -112.3350034807972,36.14846469024177,100
                  -112.3358353861232,36.1489624162954,100
                  -112.3345888301373,36.15026229372507,100
                  -112.3337937856278,36.14978096026463,100
                  -112.3331798208424,36.1504472788618,100
                  -112.3348783983763,36.1514008468736,100

The following is the resulting CSV after running the above code snippet (new CSV file: 'my_file.csv'):

36.10677870477137,-112.0814237830345,36.0905099328766,-112.0870267752693
36.10673460007995,-112.080622229595,36.09049598612422,-112.085242575315
36.09447672602546,-112.265654928602,36.09342608838671,-112.2660384528238,36.09251058776881,-112.2668139013453,36.09189827357996,-112.2677826834445,36.0913137941187,-112.2688557510952,36.0903677207521,-112.2694810717219,36.08932171487285,-112.2695268555611,36.08850916060472,-112.2690144567276,36.08753813597956,-112.2681528815339,36.08682685262568,-112.2670588176031,36.08646312301303,-112.2657374587321
36.07954952145647,-112.2550785337791,36.08117083492122,-112.2549277039738,36.08260761307279,-112.2552505069063,36.08395660588506,-112.2564540158376,36.08511401044813,-112.2580238976449,36.08584355239394,-112.2595218489022,36.08612634548589,-112.2608216347552,36.08626019085147,-112.262073428656,36.08621519860091,-112.2633204928495,36.08627897945274,-112.2644963846444,36.08649599090644,-112.2656969554589
36.09886943729116,-112.2532845153347,36.09919570465255,-112.2540466121145,36.09984998366178,-112.254734666947,36.10051310621746,-112.255493345654,36.10108441943419,-112.2563157098468,36.10159722088088,-112.2568033076439,36.10204323542867,-112.257494011321,36.10229131995655,-112.2584106072308,36.10240001286358,-112.2596588987972,36.10213176873407,-112.2610581199487,36.10157011437219,-112.2626285262793
36.09445214722695,-112.2656634181359,36.09520916122063,-112.2652238941097,36.09580763864907,-112.2645079986395,36.09628572284063,-112.2638827428817,36.09679275951239,-112.2635746835406,36.09740038871899,-112.2635711822407,36.09804913435539,-112.2640296531825,36.09880337400301,-112.264327720538,36.09963644790288,-112.2642436562271,36.10055381117246,-112.2639148687042,36.10149062823369,-112.2626894973474
37.42257124044786,-122.0848938459612,37.42211922626856,-122.0849580979198,37.42207183952619,-122.0847469573047,37.42209006729676,-122.0845725380962,37.42215932700895,-122.0845954886723,37.42227278564371,-122.0838521118269,37.42203539112084,-122.083792243335,37.42209006957106,-122.0835076656616,37.42200987395161,-122.0834709464152,37.4221046494946,-122.0831221085748,37.42226503990386,-122.0829247374572,37.42231242843094,-122.0829339169385,37.42225046087618,-122.0833837359737,37.42234159228745,-122.0833607854248,37.42237075460644,-122.0834204551642,37.42251292011001,-122.083659133885,37.42265873093781,-122.0839758438952,37.42265143972521,-122.0842374743331,37.4226514386435,-122.0845036949503,37.42261133916315,-122.0848020460801,37.42256395055121,-122.0847882750515,37.42257124044786,-122.0848938459612
37.42227033155257,-122.0857412771483,37.42231408832346,-122.0858169768481,37.42230337469744,-122.085852582875,37.42225686138789,-122.0858799945639,37.4222311076138,-122.0858860101409,37.42220250173855,-122.0858069157288,37.42214027058678,-122.0858379542653,37.42208690214408,-122.0856732640519,37.42214885429042,-122.0856022926407,37.422128290487,-122.0855902778436,37.42208171967246,-122.0855841672237,37.42210455874995,-122.0854852065741,37.42214267949824,-122.0855067264352,37.42212783846172,-122.0854430712915,37.42251282407603,-122.0850990714904,37.42281815323651,-122.0856769818632,37.42244918858722,-122.0860162273783,37.42229239604253,-122.0857260327004,37.42227033155257,-122.0857412771483
37.42136208886969,-122.0857862287242,37.42136935989481,-122.0857312990603,37.42140934910903,-122.0857312992918,37.42138390166565,-122.0856077073679,37.42137299550869,-122.0855802426516,37.42137299504316,-122.0852186221971,37.42161656508265,-122.0852277765639,37.42160565894403,-122.0852598189347,37.42168200156,-122.0852598185499,37.42170017860346,-122.0852369311478,37.42176197982575,-122.0852643957828,37.42176198013907,-122.0853239032746,37.421852864452,-122.0853559454324,37.42188921823734,-122.0854108752463,37.42189285337048,-122.0854795379357,37.42188921797546,-122.0855436229819,37.42186013499926,-122.0856260178042,37.42186013453605,-122.085937287963,37.42160898590042,-122.0859428718666,37.42157992759144,-122.0859655469861,37.42147115002957,-122.0858640462341,37.42140571326184,-122.0858548911215,37.4214057134039,-122.0858091162768,37.42136208886969,-122.0857862287242
37.42177253003091,-122.0844371128284,37.42191111542896,-122.0845118855746,37.42178755121535,-122.0850470999805,37.42143663023161,-122.0850719913391,37.42137237822116,-122.084916406232,37.42137237801626,-122.0842193868167,37.42147617161496,-122.08421938659,37.4214613409357,-122.0838086419991,37.42131306410796,-122.0837899728564,37.42129328840593,-122.0832796534698,37.42139213944298,-122.0832609819207,37.42137236399876,-122.0829373621737,37.42151569778871,-122.0829062425667,37.42176282576465,-122.0828502269665,37.42176776969635,-122.0829435788635,37.42179248552686,-122.083217411188,37.4217480074456,-122.0835970430103,37.42169364237603,-122.0839455556771,37.42176283815853,-122.0840077894637,37.42174801104392,-122.084113587521,37.42171341292375,-122.0840762473784,37.42167881534569,-122.0841447047739,37.42181720660197,-122.084144704223,37.4218170700446,-122.0842503333074,37.42177253003091,-122.0844371128284
38.87253259892824,-77.05788457660967,38.87291016281703,-77.05465973756702,38.87053267794386,-77.05315536854791,38.868757801256,-77.05552622493516,38.86996206506943,-77.05844056290393,38.87253259892824,-77.05788457660967
38.87154239798456,-77.05668055019126,38.87167890344077,-77.05542625960818,38.87076535397792,-77.05485125901024,38.87008686581446,-77.05577677433152,38.87054446963351,-77.05691162017543,38.87154239798456,-77.05668055019126
36.14888505105317,-112.3372510731295,36.14781540589019,-112.3356128688403,36.14658677734382,-112.3368169371048,36.14762778914076,-112.3384408457543,36.14888505105317,-112.3372510731295
36.14637618647505,-112.3396586818843,36.14531751871353,-112.3380597654315,36.14659596244607,-112.3368254237788,36.14762621763982,-112.3384555043203,36.14637618647505,-112.3396586818843
36.14988705767721,-112.3349463145932,36.14941108398372,-112.3354019540677,36.14878490381308,-112.3344428289146,36.14780840132443,-112.3331289492913,36.14680755678357,-112.3317019516947,36.1474173426228,-112.331131440106,36.14845453364654,-112.332616324338,36.14926570522069,-112.3339876620524,36.14988705767721,-112.3349463145932
36.1514008468736,-112.3348783983763,36.14888517553886,-112.3372535345629,36.14781612679284,-112.3356068927954,36.14846469024177,-112.3350034807972,36.1489624162954,-112.3358353861232,36.15026229372507,-112.3345888301373,36.14978096026463,-112.3337937856278,36.1504472788618,-112.3331798208424,36.1514008468736,-112.3348783983763

How to Convert KMZ to CSV in Python?

Files in the KML format are often packaged and distributed as KMZ files with the suffix .kmz.

KMZ files are zipped KML files with a special format for the content: a single root KML document named doc.kml. It has all additional files such as images and icons and 3D models located in the zip folder as well.

To convert a KMZ file to a CSV, you can unzip it and convert the root KML file to a .csv file in Python by using the BeautifulSoup and the csv libraries. You use the former to read the XML-structured KML file and the latter to write the CSV file row by row.

The remaining (non-CSV) contents of the zip folder, such as images, can hardly be converted to a CSV anyways.

See the code above for the KML to CSV conversion.

The post Python – How to Convert KML to CSV? appeared first on Be on the Right Side of Change.

Python BeautifulSoup XML to Dict, JSON, DataFrame, CSV

Jordan Marshall — Sat, 16 Jul 2022 15:04:30 +0000

Though Python’s BeautifulSoup module was designed to scrape HTML files, it can also be used to parse XML files.

In today’s professional marketplace, it is useful to be able to change an XML file into other formats, specifically dictionaries, CSV, JSON, and dataframes according to specific needs.

In this article, we will discuss that process.

Scraping XML with BeautifulSoup

Extensible Markup Language or XML differs from HTML in that HTML primarily deals with how information is displayed on a webpage, and XML handles how data is stored and transmitted. XML also uses custom tags and is designed to be user and machine-readable.

When inspecting a webpage, a statement at the top of the page will denote what type of file you are viewing.

For an XML file, you may see .

As a side note, “version 1.0” is a little deceiving in that several modifications have been made since its inception in 1998 the name has just not changed.

Despite the differences between HTML and XML, because BeautifulSoup creates a Python object tree, it can be used to parse both. The process for parsing both is similar. For this article, I will be using a sample XML file from w3 schools.com.

Import the BeautifulSoup library and requests modules to scrape this file.

# Import needed libraries
from pprint import pprint
from bs4 import BeautifulSoup
import requests

Once these have been imported, request the content of the webpage.

# Request data
webpage = requests.get("https://www.w3schools.com/xml/cd_catalog.xml")
data = webpage.content
pprint(data)

At this point, I like to print just to make sure I am getting what I need. I use the pprint() function to make it more readable.

Next, create a BeautifulSoup object and declare the parser to be used. Because it is an XML file, use an XML parser.

# Create a BeautifulSoup object
soup = BeautifulSoup(data, 'xml')
print(soup.prettify())

With that printed, you can see the object tree created by BeautifulSoup. The parent, “”, its child “”, and all of the children of “CD” are displayed.

Output of the first CD:



Empire Burlesque
Bob Dylan
USA
Columbia
10.90
1985

All left is to scrape the desired data and display it.

Using the enumerate() and find_all() function each occurrence of a tag can be found, and its contents can be placed into a list.

After that, using a for loop, unpack the created lists, and create groupings. The .text attribute string and strip() function gives only the text and removes the white space.

Just for readability, print a blank line after each grouping.

# Scrape data
parent = soup.find('CATALOG')
for n, tag in enumerate(parent.find_all('CD')):
    title = [x for x in tag.find_all('TITLE')]
    artist = [x for x in tag.find_all('ARTIST')]
    country = [x for x in tag.find_all('COUNTRY')]
    company = [x for x in tag.find_all('COMPANY')]
    price = [x for x in tag.find_all('PRICE')]
    year = [x for x in tag.find_all('YEAR')]
    # view data
    for item in title:
        print('Title: ', item.text.strip())
    for item in artist:
        print('Artist: ', item.text.strip())
    for item in country:
        print('Country: ', item.text.strip())
    for item in company:
        print('Company: ', item.text.strip())
    for item in price:
        print('Price: ', item.text.strip())
    for item in year:
        print('Year: ', item.text.strip())
    print()

With that, the CDs should be cataloged in this format.

Title:  Empire Burlesque
Artist:  Bob Dylan
Country:  USA
Company:  Columbia
Price:  10.90
Year:  1985

XML to Dictionary

Besides lists, dictionaries are a common structure for storing data in Python.

Information is stored in key: value pairs. Those pairs are stored within curly {} brackets.

Example: capital = {Pennsylvania: Harrisburg, Michigan: Lansing}

The key of the pair is case-sensitive and unique. The value can be any data type and may be duplicated.

Accessing the value of the pair can be done via the Key. Since the key cannot be duplicated, finding a value in a large dictionary is easy so long as you know the key. A key list can be obtained using the keys() method.

Example: print(capital.keys())

Finding information in a dictionary is quick since you only search for a specific key.

Dictionaries are used quite often, if memory usage is not a concern, because of the quick access. For this reason, it is important to know how to convert information gained in an XML file to a dictionary.

There are six basic steps to convert an XML to a dictionary:

import xmltodict
import pprint
with open('C:\Users\Jordan Marshall\Downloads\cd_catalog.xml', 'r', encoding='utf-8') as file:
- cd_xml = file.read()
cd_dict = xmltodict.parse(cd_xml)
cd_dict_list = [dict(x) for x in cd_dict['CATALOG']['CD']]
pprint.pprint(cd_dict_list)

First, for the conversion, Python has a built-in called xmltodict. So first import that module and any other modules to be used.

import xmltodict
import pprint

Second, the file needs to be opened, read, and assigned to a variable.

with open('C:\\Users\\Jordan Marshall\\Downloads\\cd_catalog.xml', 'r', encoding='utf-8') as file:
    cd_xml = file.read()

Third, using xmltodict.parse() convert the XML file to a dictionary and view it.

cd_dict = xmltodict.parse(cd_xml)
cd_dict_list = [dict(x) for x in cd_dict['CATALOG']['CD']]
pprint.pprint(cd_dict_list)

The output of this is a nice clean list of dictionaries. To view all artists, a simple for loop can be used.

for item in cd_dict_list:
    print(item['ARTIST'])

XML to JSON

JSON stands for JavaScript Object Notation. These files store data in key:value form like a Python dictionary. JSON files are used primarily to transmit data between web applications and servers.

Converting an XML file to a JSON file requires only a few lines of code.

As always, import the needed libraries and modules.

import json
from pprint import pprint
import xmltodict

Again, you will see the use of xmltodict. Because of their similarities, first, convert the file to a dictionary and then later write it to a JSON file. The json_dumps() function is used to take in the XML data. That data will later be written to a JSON file.

with open('C:\\Users\\Jordan Marshall\\Downloads\\cd_catalog example.xml') as xml_file:
    data_dict = xmltodict.parse(xml_file.read())
    xml_file.close()
    json_data = json.dumps(data_dict)
    with open('data.json', 'w') as json_file:
        json_file.write(json_data)
        json_file.close()

Output:

('{"CATALOG": {"CD": [{"TITLE": "Empire Burlesque", "ARTIST": "Bob Dylan", '
 '"COUNTRY": "USA", "COMPANY": "Columbia", "PRICE": "10.90", "YEAR": "1985"}, '
 '{"TITLE": "Hide your heart", "ARTIST": "Bonnie Tyler", "COUNTRY": "UK", '
 '"COMPANY": "CBS Records", "PRICE": "9.90", "YEAR": "1988"}, {"TITLE": '
 '"Greatest Hits", "ARTIST": "Dolly Parton", "COUNTRY": "USA", "COMPANY": '
 '"RCA", "PRICE": "9.90", "YEAR": "1982"}, {"TITLE": "Still got the blues", '….)

The data that started as an XML file has now been written to a JSON file called json_data.

XML to DataFrame

There are a couple of ways to achieve this goal.

Using Python’s ElementTree is one. I am, however, partial to Pandas.

Pandas is a great module for working with data, and it simplifies many daily tasks of a programmer and data scientist. I strongly suggest becoming familiar with this module.

For this code, use a combination of BeautifulSoup and Pandas.

Import the necessary libraries.

import pandas as pd
from bs4 import BeautifulSoup

To display the output fully, display values may need to be altered. I am going to set the max number of columns as well as the display width. This will overwrite any default settings that may be in place.

Without doing this, you may find some of your columns are replaced by ‘…’ or the columns may be displayed under your first couple of columns.

# set max columns and display width
pd.set_option("display.max_columns", 10)
pd.set_option("display.width", 1000)

The width and columns can be changed according to your needs. With that completed, open and read the XML file. Store the contents in a variable.

xml_file = open('C:\\Users\\Jordan Marshall\\Downloads\\cd_catalog.xml', 'r')
contents = xml_file.read()

Next, create a BeautifulSoup object.

# BeautifulSoup object
soup = BeautifulSoup(contents, 'xml')

The next step is to extract the data and assign it to a variable.

# Extract data and assign it to a variable
title = soup.find_all("TITLE")
artist = soup.find_all("ARTIST")
country = soup.find_all("COUNTRY")
company = soup.find_all("COMPANY")
price = soup.find_all("PRICE")
year = soup.find_all("YEAR")

Now a for loop can be used to extract the text.

Should data be added or removed at any time using the length of one of the variables removes the need to know from memory how many items are cataloged.

Place the text in an empty list.

# Text
cd_info = []
for i in range(0, len(title)):
    rows = [title[i].get_text(),
            artist[i].get_text(),
            country[i].get_text(),
            company[i].get_text(),
            price[i].get_text(),
            year[i].get_text()]
    cd_info.append(rows)

Lastly, create the data frame and name the columns.

# Create a dataframe with Pandas and print
df = pd.DataFrame(cd_info, columns=['Title', 'Artist ', '   Company', 'Country', '   Price', '   Year'])
print(df)

Output

            Title                  Artist              Country         Company      Price     Year
0           Empire Burlesque       Bob Dylan           USA             Columbia     10.90     1985
1           Hide your heart        Bonnie Tyler        UK              CBS Records  9.90      1988
2           Greatest Hits          Dolly Parton        USA             RCA          9.90      1982

A nice, neat table containing each CD’s data has been created.

XML to CSV

A CSV file or comma-separated values file contains plain text easily readable by the user. It can contain numbers and letters only and is used to exchange data between apps. CSV files can be opened by any editor.

For example, Microsoft Excel. Each line represents a new row of data. The comma represents a new column. Using the code from above the XML file can be converted to a CSV file with one new line.

catalog = df.to_csv('cd catalog.csv')

With that, go to files and search the C: drive for 'cd catalog.csv'. It will open in the default program used for spreadsheets. In this case Microsoft Excel.

Title	Artist	Country	Company	Price	Year
Empire Burlesque	Bob Dylan	USA	Columbia	10.90	1985
Hide your heart	Bonnie Tyler	UK	CBS Records	9.90	1988
Greatest Hits	Dolly Parton	USA	RCA	9.90	1982
Still got the blues	Gary Moore	UK	Virgin records	10.20	1990
Eros	Eros Ramazzotti	EU	BMG	9.90	1997
One night only	Bee Gees	UK	Polydor	10.90	1998
Sylvias Mother	Dr.Hook	UK	CBS	8.10	1973
Maggie May	Rod Stewart	UK	Pickwick	8.50	1990
Romanza	Andrea Bocelli	EU	Polydor	10.80	1996

Related Tutorial: How to Convert a KML to a CSV File in Python?

The post Python BeautifulSoup XML to Dict, JSON, DataFrame, CSV appeared first on Be on the Right Side of Change.

Scrape a Bookstore in 5 Steps Python [Learn Project]

Chris — Tue, 14 Jun 2022 21:23:53 +0000

Story: This series of articles assume you work in the IT Department of Mason Books. The Owner asks you to scrape the website of a competitor. He would like this information to gain insight into his pricing structure.

Note: Before continuing, we recommend you possess, at minimum, a basic knowledge of HTML and CSS and have reviewed our articles on How to Scrape HTML tables.

What You’ll Build in This Project

Let’s navigate to Books to Scrape and review the format.

At first glance, you will notice:

Book categories display on the left-hand side.
There are, in total, 1,000 books listed on the website.
Each web page shows 20 Books.
Each price is in £ (in this instance, the UK pound).
Each Book displays minimum details.
To view complete details for a book, click on the image or the Book Title hyperlink. This hyperlink forwards to a page containing additional book details for the selected item (see below).
The total number of website pages displays in the footer (Page 1 of 50).

Step 1: Install and Import Libraries for Project

Before any data manipulation can occur, three (3) new libraries will require installation.

The Pandas library enables access to/from a DataFrame.
The Requests library provides access to the HTTP requests in Python.
The Beautiful Soup library enables data extraction from HTML and XML files.

To install these libraries, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the key on the keyboard to start the installation process.

$ pip install requests

Hit the key on the keyboard to start the installation process.

$ pip install beautifulsoup4

Hit the key on the keyboard to start the installation process.

If the installations were successful, a message displays in the terminal indicating the same.

Feel free to view the PyCharm installation guides for the required libraries.

Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import urllib.request
from csv import reader, writer

The time library is built-in with Python and does not require installation. This library contains time.sleep() and is used to set a delay between page scrapes.
The urllib library is built-in with Python and does not require installation. This library contains urllib.request and is used to save images.
The csv library is built-in Pandas and does not require additional installation. This library contains reader and writer methods to save data to a CSV file.

Step 2: Understand Basics and Scrape Your First Results

In this step, you’ll perform the following tasks:

Reviewing the website to scrape.
Understanding HTTP Status Codes.
Connecting to the Books to Scrape website using the requests library.
Retrieving Total Pages to Scrape
Closing the Open Connection.

Learn More: Learn everything you need to know to reproduce this step in the in-depth Finxter blog tutorial.

Step 3: Configure URL to Scrape and Avoid Spamming the Server

Rule: Don’t Spam the Server!

In this step, you’ll perform the following tasks:

Configuring a page URL for scraping
Setting a delay: time.sleep() to pause between page scrapes.
Looping through two (2) pages for testing purposes.

Learn More: Learn everything you need to know to reproduce this step in the in-depth Finxter blog tutorial.

Step 4: Save Book Details in a Python List

In this step, you’ll perform the following tasks:

Locating Book details.
Writing code to retrieve this information for all Books.
Saving Book details to a List.

Learn More: Learn everything you need to know to reproduce this step in the in-depth Finxter blog tutorial.

Step 5: Clean and Save the Scraped Output

In this step, you’ll perform the following tasks:

Cleaning up the scraped code.
Saving the output to a CSV file.

Learn More: Learn everything you need to know to reproduce this step in the in-depth Finxter blog tutorial.

Conclusion

This tutorial has guided you through the steps to create your first practical web scraping project: scraping the contents of a book store!

Now, go out and use your skills wisely and to the benefit of humanity, my friend!

The post Scrape a Bookstore in 5 Steps Python [Learn Project] appeared first on Be on the Right Side of Change.

BeautifulSoup Archives - Be on the Right Side of Change

How I Scraped Data From Over 16,000 Gyms from MindBodyOnline.com

Why Start with APIs?

How to check if a website is rendered with Javascript

Check the JavaScript Requests

Using Insomnia to Generate Request Headers

A Simple Scrapy Spider

Parsing the Response

Handling Pagination

Make the Request for the Next Page

How the Pagination Loop Terminates

It’s alive! Setting Our Little Spider Loose

Results

Conclusion

Python BS4 – How to Scrape Absolute URL Instead of Relative Path

Quick Answer

Problem Formulation

Method 1: Using urllib.parse.urljoin()

Method 2: Concatenate The Base URL And Relative URL Manually

Conclusion

3 Pythonic Ways to Download a PDF from a URL

Understanding the Basics

Method 1: Using the Requests Library

Setting Up Requests

Downloading a PDF File

Method 2: Utilizing the Urllib Library

Importing Urllib

Downloading a PDF with Urllib

Method 3: Incorporating BeautifulSoup

Integrating BeautifulSoup

Extracting PDFs from HTML Source

Handling Errors and Exceptions

Anticipating Common Errors

Implementing Error Handling Functions

Organizing Downloaded PDFs

Creating a Directory

Saving PDFs to a Directory

Frequently Asked Questions

How can I download a PDF file from a URL using Python?

What is the best way to download a PDF file from a website using Python?

How do I save a PDF file in Python after downloading it from a URL?

What is the easiest way to download a PDF file using requests in Python?

How can I scrape a PDF file from a website using BeautifulSoup in Python?

What is the most efficient way to download a file from a URL and save it to a directory using Python?

Solving Response [403] HTTP Forbidden Error: Scraping SEC EDGAR

Understanding the Error

Bypassing the Error

Caution and Consideration

Python Web Scraping: From URL to CSV in No Time

Setting up the Environment

Fetching and Parsing URL

Sending HTTP Requests

Handling HTTP Errors

Extracting Data from HTML

Using BeautifulSoup

Locating Elements by Tags and Attributes

Organizing Data

Creating a Dictionary

Working with DataFrames and Pandas

Exporting Data to a CSV File

Handling Pagination

Looping Through Web Pages

Exporting Data to CSV

Python CSV Module

Using Pandas to_csv()

10 Minutes to Pandas in 5 Minutes

How to Access the First, Second, or N-th Child Div Element in BeautifulSoup?

Keep Learning

I Built a Kids’ Movie Ratings Database Using Beautiful Soup

Project Motivation

What You Will Learn

Planning the Scraping Approach

Scraping the Movie Titles and Ratings

Scrape sexual content description

tag preceding the tag that contained the detailed description. To handle this variation, I added some code. The final version of scrape_kim_sexcontent() is below. First, I looked for all of the

tag rather than being nested within it. This means that the tag is a sibling of the

tags aren't used to make the paragraph headers # If you haven't found sex content yet, search all the p tags for "SEX/NUDITY" if sex_content == "": p_set = soup.findAll("p") for entry in p_set: if 'SEX/NUDITY' in entry.text: sex_content = entry.text break return sex_content

Organize Data and Save it to a File

Conclusion

The Script

tag preceding the
tag that contained the detailed description.

To handle this variation, I added some code. The final version of `scrape_kim_sexcontent()` is below. First, I looked for all of the

tag rather than being nested within it.

This means that the
tag is a sibling of the