<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>BeautifulSoup Archives - Be on the Right Side of Change</title>
	<atom:link href="https://blog.finxter.com/category/beautifulsoup/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.finxter.com/category/beautifulsoup/</link>
	<description></description>
	<lastBuildDate>Thu, 19 Oct 2023 08:56:59 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://blog.finxter.com/wp-content/uploads/2020/08/cropped-cropped-finxter_nobackground-32x32.png</url>
	<title>BeautifulSoup Archives - Be on the Right Side of Change</title>
	<link>https://blog.finxter.com/category/beautifulsoup/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>How I Scraped Data From Over 16,000 Gyms from MindBodyOnline.com</title>
		<link>https://blog.finxter.com/how-i-scraped-data-from-over-16000-gyms-from-mindbodyonline-com/</link>
		
		<dc:creator><![CDATA[Charles Blue]]></dc:creator>
		<pubDate>Thu, 19 Oct 2023 08:56:58 +0000</pubDate>
				<category><![CDATA[BeautifulSoup]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Python Requests]]></category>
		<category><![CDATA[Web Scraping]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=1652308</guid>

					<description><![CDATA[<p>This article is based on a freelance job posted on Upwork to scrape data for all the gyms in the USA from MindBodyOnline.com or another similar site. I treated this as a learning project, and it was a good one, as I learned a lot! 🕷️ Web scraping, a technique used to extract data from ... <a title="How I Scraped Data From Over 16,000 Gyms from MindBodyOnline.com" class="read-more" href="https://blog.finxter.com/how-i-scraped-data-from-over-16000-gyms-from-mindbodyonline-com/" aria-label="Read more about How I Scraped Data From Over 16,000 Gyms from MindBodyOnline.com">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/how-i-scraped-data-from-over-16000-gyms-from-mindbodyonline-com/">How I Scraped Data From Over 16,000 Gyms from MindBodyOnline.com</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>This article is based on a freelance job posted on Upwork to scrape data for all the gyms in the USA from <a href="http://MindBodyOnline.com">MindBodyOnline.com</a> or another similar site. I treated this as a learning project, and it was a good one, as I learned a lot!</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img fetchpriority="high" decoding="async" width="1024" height="598" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-156-1024x598.png" alt="" class="wp-image-1652328" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-156-1024x598.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/10/image-156-300x175.png 300w, https://blog.finxter.com/wp-content/uploads/2023/10/image-156-768x449.png 768w, https://blog.finxter.com/wp-content/uploads/2023/10/image-156-1536x897.png 1536w, https://blog.finxter.com/wp-content/uploads/2023/10/image-156.png 1585w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p class="has-base-2-background-color has-background"><strong><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f577.png" alt="🕷" class="wp-smiley" style="height: 1em; max-height: 1em;" /></strong> <strong>Web scraping</strong>, a technique used to extract data from websites, has become an essential skill on Upwork &#8212; it&#8217;s one of the most sought-after skills on most <a href="https://blog.finxter.com/best-python-freelancer-platforms/">freelancing platforms</a>. Most beginners start with the <strong><a href="https://blog.finxter.com/installing-beautiful-soup/">Beautiful Soup</a></strong> and <strong><a href="https://blog.finxter.com/python-requests-library-2/">Requests</a></strong> modules in Python. While these tools are powerful, they&#8217;re not always sufficient for every site. Enter tools like <strong><a href="https://blog.finxter.com/how-to-open-a-url-in-python-selenium/">Selenium</a></strong>, which, while powerful, can sometimes be overkill or inefficient. </p>



<p>So, where should one start? The answer is simple: Always check for an API first.</p>



<h3 class="wp-block-heading">Why Start with APIs?</h3>



<p>An <strong>Application Programming Interface (API)</strong> allows two software applications to communicate with each other. Many websites offer APIs to provide structured access to their data, making it easier and more efficient than scraping the web pages directly.</p>



<p>Benefits of using APIs:</p>



<ul class="wp-block-list">
<li><strong>Efficiency</strong>: Extracting data from APIs is often faster and less resource-intensive than scraping web pages.</li>



<li><strong>Reliability</strong>: APIs are designed to be accessed programmatically, reducing the chances of breaking changes.</li>



<li><strong>Ethical considerations</strong>: Accessing data via an API is often more in line with a website&#8217;s terms of service than scraping their pages directly.</li>
</ul>



<p>MindBodyOnline provides a dedicated API tailored for developers: <a href="https://developers.mindbodyonline.com/ui/documentation/public-api#/http/mindbody-public-api-v6-0/introduction/getting-started">MindBody API</a>. </p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img decoding="async" width="1024" height="536" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-140-1024x536.png" alt="" class="wp-image-1652310" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-140-1024x536.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/10/image-140-300x157.png 300w, https://blog.finxter.com/wp-content/uploads/2023/10/image-140-768x402.png 768w, https://blog.finxter.com/wp-content/uploads/2023/10/image-140.png 1426w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>If you&#8217;re aiming to craft an app utilizing their dataset, this API is your ideal resource. It boasts a plethora of endpoints, enabling swift data retrieval and ensuring seamless interaction between your application and their servers.</p>



<p><strong>But what if you aren’t creating an application and just need to scrape data once for research?</strong> MindBodyOnline also retrieves data for its website via an API. Javascript is used to request the data needed to populate their website. We can also make requests for this API.</p>



<h2 class="wp-block-heading">How to check if a website is rendered with Javascript</h2>



<p>The site we will be scraping is <a href="https://www.mindbodyonline.com/explore">MindBodyOnline</a>. </p>



<p>If a website is rendered with <a href="https://blog.finxter.com/javascript-data-types/">Javascript</a>, we should check the network traffic and see if we can find a request that returns the data we see on the page. This can be done quickly with developer tools. With Chrome, you can bring up developer tools by clicking <code>Ctl-Shift-I</code>. </p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img decoding="async" width="1024" height="709" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-143-1024x709.png" alt="" class="wp-image-1652313" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-143-1024x709.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/10/image-143-300x208.png 300w, https://blog.finxter.com/wp-content/uploads/2023/10/image-143-768x531.png 768w, https://blog.finxter.com/wp-content/uploads/2023/10/image-143-1536x1063.png 1536w, https://blog.finxter.com/wp-content/uploads/2023/10/image-143.png 1620w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>From here, we can turn off Javascript, then refresh the page and see if there are any changes. To turn off Javascript, first hit <code>Ctl-Shift-P</code> to bring up the command palette. Start typing Javascript to filter the options, then click “Disable javascript”.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="506" height="117" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-141.png" alt="" class="wp-image-1652311" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-141.png 506w, https://blog.finxter.com/wp-content/uploads/2023/10/image-141-300x69.png 300w" sizes="auto, (max-width: 506px) 100vw, 506px" /></figure>
</div>


<p>Then refresh the page. As we can see, they use JavaScript for all the data.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="444" height="95" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-142.png" alt="" class="wp-image-1652312" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-142.png 444w, https://blog.finxter.com/wp-content/uploads/2023/10/image-142-300x64.png 300w" sizes="auto, (max-width: 444px) 100vw, 444px" /></figure>
</div>


<p>Before we can continue, we need to turn JavaScript back on. Bring up the command palette again, filter for javascript, and click “Enable Javascript”. Then refresh the page again.</p>



<h2 class="wp-block-heading">Check the JavaScript Requests</h2>



<p>Select the Network tab in developer tools.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="624" height="152" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-144.png" alt="" class="wp-image-1652314" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-144.png 624w, https://blog.finxter.com/wp-content/uploads/2023/10/image-144-300x73.png 300w" sizes="auto, (max-width: 624px) 100vw, 624px" /></figure>
</div>


<p>Make sure <code>Fetch/XHR</code> and <code>Preserve log</code> are selected. Next, we can click the circle with the line through it to clear the output. Then perform a search to see what requests were performed.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="624" height="192" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-145.png" alt="" class="wp-image-1652315" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-145.png 624w, https://blog.finxter.com/wp-content/uploads/2023/10/image-145-300x92.png 300w" sizes="auto, (max-width: 624px) 100vw, 624px" /></figure>
</div>


<p>We can then check each item in the output to see if it returns useful information.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="601" height="255" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-146.png" alt="" class="wp-image-1652316" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-146.png 601w, https://blog.finxter.com/wp-content/uploads/2023/10/image-146-300x127.png 300w" sizes="auto, (max-width: 601px) 100vw, 601px" /></figure>
</div>


<p>We are primarily interested in the response to the request. We are looking for XML data that looks like the data shown on the page. In this case, it is the <code>locations</code> request that contains the data we seek.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="624" height="395" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-147.png" alt="" class="wp-image-1652317" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-147.png 624w, https://blog.finxter.com/wp-content/uploads/2023/10/image-147-300x190.png 300w" sizes="auto, (max-width: 624px) 100vw, 624px" /></figure>
</div>


<p>We can also see that there is a payload required. When we make our requests, we must provide this payload in the request body. There are three items of interest here. The latitude and longitude allow us to control the city we are pulling data for, and we also need to provide a page number.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="562" height="111" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-148.png" alt="" class="wp-image-1652318" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-148.png 562w, https://blog.finxter.com/wp-content/uploads/2023/10/image-148-300x59.png 300w" sizes="auto, (max-width: 562px) 100vw, 562px" /></figure>
</div>


<p>MindBody uses pagination, so a relatively small amount of data is pulled with each request. A large city like New York can have over a hundred pages.</p>



<p>We go to the headers tab to copy the request URL.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="624" height="120" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-149.png" alt="" class="wp-image-1652319" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-149.png 624w, https://blog.finxter.com/wp-content/uploads/2023/10/image-149-300x58.png 300w" sizes="auto, (max-width: 624px) 100vw, 624px" /></figure>
</div>


<h2 class="wp-block-heading">Using Insomnia to Generate Request Headers</h2>



<p>From here, we can use a tool to help us with the request syntax. </p>



<p class="has-base-2-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f517.png" alt="🔗" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Insomnia</strong> is a powerful open-source API client tool for testing and debugging APIs. It provides a user-friendly interface to send requests to web services and view responses. With Insomnia, you can define various request types, from simple HTTP GET requests to complex JSON, GraphQL, or even multipart file uploads. You can download the insomnia desktop app <a href="https://insomnia.rest/download">here</a>.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="601" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-150-1024x601.png" alt="" class="wp-image-1652320" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-150-1024x601.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/10/image-150-300x176.png 300w, https://blog.finxter.com/wp-content/uploads/2023/10/image-150-768x451.png 768w, https://blog.finxter.com/wp-content/uploads/2023/10/image-150.png 1342w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>Using Insomnia is quite simple. Just paste in the API URL and click <code>Send</code>.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="624" height="152" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-151.png" alt="" class="wp-image-1652321" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-151.png 624w, https://blog.finxter.com/wp-content/uploads/2023/10/image-151-300x73.png 300w" sizes="auto, (max-width: 624px) 100vw, 624px" /></figure>
</div>


<p>We can check the preview tab to make sure it returns the data we want:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="493" height="506" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-152.png" alt="" class="wp-image-1652322" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-152.png 493w, https://blog.finxter.com/wp-content/uploads/2023/10/image-152-292x300.png 292w" sizes="auto, (max-width: 493px) 100vw, 493px" /></figure>
</div>


<p>This is where it gets good. If we click the dropdown on the send button, one of the options is “generate client code”. How convenient! Just click Python as the language and use the Requests module and you can click “Copy to Clipboard” and you’re off to the races.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="624" height="397" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-153.png" alt="" class="wp-image-1652323" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-153.png 624w, https://blog.finxter.com/wp-content/uploads/2023/10/image-153-300x191.png 300w" sizes="auto, (max-width: 624px) 100vw, 624px" /></figure>
</div>


<h2 class="wp-block-heading">A Simple Scrapy Spider</h2>



<p>The code can be found on <a href="https://github.com/PythonCB/Scrape_MindBodyOnline">Github</a>. I will walk through the code below, starting with the imports.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import scrapy
import json
import pandas as pd
from scrapy.crawler import CrawlerProcess
import os
</pre>



<p><a href="https://blog.finxter.com/python-scrapy-scraping-dynamic-website-with-api-generated-content/">Scrapy</a> is a good option because it can handle multiple requests at the same time with <a href="https://blog.finxter.com/python-async-for-mastering-asynchronous-iteration-in-python/">asynchronous</a> processing. Scapy has a lot of bells and whistles and a fair bit of a learning curve, but it’s also possible to avoid a lot of the extra complexity. The goal here was to place all the code in one simple script.</p>



<p>First, we have to create a spider class. The class is pretty large so I’ll display it in chunks.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">class MindbodySpider(scrapy.Spider):
    name = 'mindbody_spider'

    custom_settings = {
        'CONCURRENT_REQUESTS': 5,
        'DOWNLOAD_DELAY': 3.2,
    }
</pre>



<p>Our class inherits from one of the Scrapy <code>Spider</code> classes with <code>scrapy.Spider</code> being the simplest. In the custom settings, with <code>CONCURRENT_REQUESTS</code> set to <code>5</code>, scrapy will be processing six requests at a time, starting a new one as soon as one finishes. </p>



<p>We use a <code>DOWNLOAD_DELAY</code> so we don’t bombard the website with too many requests at once.</p>



<p>Next, we need a starting template for the payload</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">starting_payload = '''{
                          "sort":"-_score,distance",
                          "page":{"size":50,"number":&lt;&lt;num>>},
                          "filter":{"categories":"any",
                                    "latitude":&lt;&lt;lat>>,
                                    "longitude":&lt;&lt;lon>>,
                                    "categoryTypes":"any"}
                       }'''
</pre>



<p>Next, we have the headers that Insomnia so helpfully provided for us.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">headers = {
        "cookie": "__cf_bm=zdIhLHXKd2OAveBChKORUMdydUFVzC2Ma51sQxv.UJ0-1694646164-0-Abmbwcj2wNw%2FpityY4DWRWy%2FftBkjTO0vQ3tZ0gwU0P5bsTqcasf2XZlBwL%2BUaevGaH%2BTDzZOJPBXbWYwgsXkJc%3D",
        "authority": "prod-mkt-gateway.mindbody.io",
        "accept": "application/vnd.api+json",
        "accept-language": "en-US,en;q=0.9",
        "content-type": "application/json",
        "origin": "https://www.mindbodyonline.com",
        "sec-ch-ua": "^\^Not/A",
        "sec-ch-ua-mobile": "?0",
        "sec-ch-ua-platform": "^\^Windows^^",
        "sec-fetch-dest": "empty",
        "sec-fetch-mode": "cors",
        "sec-fetch-site": "cross-site",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
        "x-mb-app-build": "2023-08-02T13:33:44.200Z",
        "x-mb-app-name": "mindbody.io",
        "x-mb-app-version": "e5d1fad6",
        "x-mb-user-session-id": "oeu1688920580338r0.2065068094427127"
    }
</pre>



<p>Then a very simple <code>init</code> method</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def __init__(self):
        scrapy.Spider.__init__(self)
        self.city_count = 0
</pre>



<p>The <code>start_requests</code> method loops through each city. This is the main loop that creates the first request for each city.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def start_requests(self):
        cities = pd.read_csv('uscities.csv')

        for idx, city in cities[].iterrows():
            lat, lon = city.lat, city.lng
            self.logger.info(f"{city.city}, {city.state_id} started")

            # Start with the first page for each city
            payload = self.starting_payload.replace('&lt;&lt;pg>>', '1').replace('&lt;&lt;lat>>', str(lat)).replace('&lt;&lt;lon>>', str(lon))

            yield scrapy.Request(
                url="https://prod-mkt-gateway.mindbody.io/v1/search/locations",
                method="GET",
                body=payload,
                headers=self.headers,
                meta={'city_name': city.city, 'page_num': 1, 'lat': lat, 'lon': lon, 'state': city.state_id},
                callback=self.parse
            )
</pre>



<p>The code is pretty simple. We <a href="https://blog.finxter.com/how-to-create-a-dataframe-in-pandas/">create a DataFrame</a> from a <a href="https://blog.finxter.com/read-a-csv-file-to-a-pandas-dataframe/">CSV file</a> with city information and then loop through it with the <code>iterrows</code> method. We create the payload for the request using the template and the lat/long values from the DataFrame. The page is set to 1 each time. We will handle additional pages later.</p>



<p>Finally, we yield a <code>scrapy.Request</code> object. We use <code><a href="https://blog.finxter.com/yield-keyword-in-python-a-simple-illustrated-guide/">yield</a></code> instead of <code><a href="https://blog.finxter.com/python-return/">return</a></code> so we can handle <a href="https://blog.finxter.com/python-async-requests-getting-urls-concurrently-via-https/">multiple requests concurrently</a>. The body is our modified payload, and we use the same header for each request.</p>



<p>What do we do with the response returned from the request? As soon as the response is returned it is fed into the parse method thanks to the callback parameter:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">callback=self.parse</pre>



<p>The <code>meta</code> parameter gives us a way to pass information to the <code>callback</code> function. We need the page <code>num</code>, <code>lat</code>, <code>lon</code> values for the next request.  <code>City_name</code> and <code>state</code> are used for screen outputs.</p>



<p>The list of cities was downloaded off the web. Many different options will work, as long as they contain latitude and longitude values.</p>



<h2 class="wp-block-heading">Parsing the Response</h2>



<p>The <code>parse</code> method is a little long, but not too complicated. </p>



<p>Getting the data and saving it is very easy. We just convert <code>response.text</code> to a DataFrame and <a href="https://blog.finxter.com/how-to-export-pandas-dataframe-to-csv-example/">save it to a CSV file</a>. If the file already exists, we will append the data and not include a header. Otherwise, we create a new CSV file and include a header.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def parse(self, response):
        data = json.loads(response.text)
        gyms_df = pd.json_normalize(data['data'])

        # Save the dataframe to a CSV
        city_name = response.meta['city_name']
        state = response.meta['state']
        fname = f'{city_name} {state}.csv'.replace(' ', '_')
        csv_path = f'./data/cities2/{fname}'

        # Check if file exists to determine the write mode
        write_mode = 'a' if os.path.exists(csv_path) else 'w'

        gyms_df.to_csv(csv_path, 
                       mode=write_mode, 
                       index=False, 
                       header=(not os.path.exists(csv_path)))         
</pre>



<h2 class="wp-block-heading">Handling Pagination</h2>



<p>To move on to the next page, we need to create another Scrapy Request. For the payload we use the same latitude and longitude and increment the page number by 1.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">        # Check if there's another page and if so, initiate the request
        next_page_num = response.meta['page_num'] + 1
        if next_page_num &lt;= 150:  # Optional: upper limit
            lat, lon = response.meta['lat'], response.meta['lon']  # Assuming you store lat and lon in meta too

            payload = self.starting_payload.replace('&lt;&lt;pg>>', '1').replace('&lt;&lt;lat>>', str(lat)).replace('&lt;&lt;lon>>', str(lon))
</pre>



<h2 class="wp-block-heading">Make the Request for the Next Page</h2>



<p>To finish the <code>parse</code> method, all we have to do is make another request with the new payload.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">yield scrapy.Request(
                url="https://prod-mkt-gateway.mindbody.io/v1/search/locations",
                method="GET",
                body=payload,
                headers=self.headers,
                meta={'city_name': response.meta['city_name'], 
                      'page_num': next_page_num, 
                      'lat': lat, 
                      'lon': lon,
                      'state': state},
                callback=self.parse
            )

        self.city_count += 1
        print(response.meta['city_name'], f'complete ({self.city_count})')
        self.logger.info(f"""{response.meta['city_name']}, 
                           {response.meta['state']} is complete""")
</pre>



<h2 class="wp-block-heading">How the Pagination Loop Terminates</h2>



<p>What happens if there are 100 pages for the current city and the code sends a request with <code>page_num = 101</code>? </p>



<p>The request will not return anything, so the callback function won’t get called and the recursive loop for that city will stop. </p>



<p>Then the <code>start_requests</code> loop will move on to the next city.</p>



<h2 class="wp-block-heading">It’s alive! Setting Our Little Spider Loose</h2>



<p>To get our creepy critter crawling, we create a <code>CrawlerProcess</code>. Then tell it to crawl. Then tell it to start. On your mark, get set, CRAWL!</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">process = CrawlerProcess()
process.crawl(MindbodySpider)
process.start()
</pre>



<h2 class="wp-block-heading">Results</h2>



<p>I was able to scrape data for 16,000 cities in about half a week. I think I averaged about 100 cities an hour. The larger cities had over a hundred pages but there were <strong>thousands upon thousands of cities with 5-10 pages</strong>.</p>



<p>What about the data? It’s fairly extensive and could be very useful.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="518" height="788" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-154.png" alt="" class="wp-image-1652324" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-154.png 518w, https://blog.finxter.com/wp-content/uploads/2023/10/image-154-197x300.png 197w" sizes="auto, (max-width: 518px) 100vw, 518px" /></figure>
</div>


<p>Pretty good information related to services offered, location, amenities, total ratings etc. Looking at the rest of the columns:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="510" height="396" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-155.png" alt="" class="wp-image-1652325" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-155.png 510w, https://blog.finxter.com/wp-content/uploads/2023/10/image-155-300x233.png 300w" sizes="auto, (max-width: 510px) 100vw, 510px" /></figure>
</div>


<h2 class="wp-block-heading">Conclusion</h2>



<p>Uncovering the API proved invaluable. It eliminated the need to craft path selectors for individual data elements, significantly streamlining the process. Moreover, it spared me from devising a Scrapy workaround for the JavaScript-rendered page. Investing time in learning Scrapy was a sound decision, given its superior speed compared to other methods I explored.</p>



<p>Looking ahead, the logical progression is to integrate the data into platforms like Jupyter Notebook, Power BI, or Tableau. Furthermore, storing the data in a database seems apt, especially considering the apparent one-to-many relationships observed in each city, like categories and subcategories.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>If you want to become a master web scraper, feel free to check out our academy course with downloadable PDF certificate to showcase your skills to future employers or freelancing clients:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="800" height="341" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-157.png" alt="" class="wp-image-1652329" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-157.png 800w, https://blog.finxter.com/wp-content/uploads/2023/10/image-157-300x128.png 300w, https://blog.finxter.com/wp-content/uploads/2023/10/image-157-768x327.png 768w" sizes="auto, (max-width: 800px) 100vw, 800px" /></figure>
</div>


<p class="has-base-2-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f517.png" alt="🔗" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Academy</strong>: <a href="https://academy.finxter.com/university/web-scraping-with-beautifulsoup/">Web Scraping with BeautifulSoup</a></p>
<p>The post <a href="https://blog.finxter.com/how-i-scraped-data-from-over-16000-gyms-from-mindbodyonline-com/">How I Scraped Data From Over 16,000 Gyms from MindBodyOnline.com</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Python BS4 &#8211; How to Scrape Absolute URL Instead of Relative Path</title>
		<link>https://blog.finxter.com/scraping-the-absolute-url-of-instead-of-the-relative-path-using-beautifulsoup/</link>
		
		<dc:creator><![CDATA[Shubham Sayon]]></dc:creator>
		<pubDate>Thu, 28 Sep 2023 19:56:58 +0000</pubDate>
				<category><![CDATA[BeautifulSoup]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Web Scraping]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=22845</guid>

					<description><![CDATA[<p>Summary: Use urllib.parse.urljoin() to scrape the base URL and the relative path and join them to extract the complete/absolute URL. You can also concatenate the base URL and the absolute path to derive the absolute path; but make sure to take care of erroneous situations like extra forward-slash in this case. Quick Answer When web ... <a title="Python BS4 &#8211; How to Scrape Absolute URL Instead of Relative Path" class="read-more" href="https://blog.finxter.com/scraping-the-absolute-url-of-instead-of-the-relative-path-using-beautifulsoup/" aria-label="Read more about Python BS4 &#8211; How to Scrape Absolute URL Instead of Relative Path">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/scraping-the-absolute-url-of-instead-of-the-relative-path-using-beautifulsoup/">Python BS4 &#8211; How to Scrape Absolute URL Instead of Relative Path</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p class="has-global-color-8-background-color has-background"><strong>Summary: </strong>Use <a href="https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urljoin" target="_blank" rel="noreferrer noopener"><code data-enlighter-language="generic" class="EnlighterJSRAW">urllib.parse.urljoin()</code></a> to scrape the base URL and the relative path and join them to extract the complete/<strong>absolute </strong>URL. You can also concatenate the base URL and the absolute path to derive the absolute path; but make sure to take care of erroneous situations like extra forward-slash in this case.</p>



<h2 class="wp-block-heading">Quick Answer</h2>



<p>When web scraping with BeautifulSoup in Python, you may encounter relative URLs (e.g., <code>/page2.html</code>) instead of absolute URLs (e.g., <code>http://example.com/page2.html</code>). To convert relative URLs to absolute URLs, you can use the <code>urljoin()</code> function from the <code>urllib.parse</code> module.</p>



<p>Below is an example of how to extract absolute URLs from the <code>a</code> tags on a webpage using <code>BeautifulSoup</code> and <code>urljoin</code>:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="816" height="757" src="https://blog.finxter.com/wp-content/uploads/2023/09/image-130.png" alt="" class="wp-image-1651860" srcset="https://blog.finxter.com/wp-content/uploads/2023/09/image-130.png 816w, https://blog.finxter.com/wp-content/uploads/2023/09/image-130-300x278.png 300w, https://blog.finxter.com/wp-content/uploads/2023/09/image-130-768x712.png 768w" sizes="auto, (max-width: 816px) 100vw, 816px" /></figure>
</div>


<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin

# URL of the webpage you want to scrape
url = 'http://example.com'

# Send an HTTP request to the URL
response = requests.get(url)
response.raise_for_status()  # Raise an error for bad responses

# Parse the webpage content
soup = BeautifulSoup(response.text, 'html.parser')

# Find all the 'a' tags on the webpage
for a_tag in soup.find_all('a'):
    # Get the href attribute from the 'a' tag
    href = a_tag.get('href')

    # Use urljoin to convert the relative URL to an absolute URL
    absolute_url = urljoin(url, href)

    # Print the absolute URL
    print(absolute_url)</pre>



<p>In this example:</p>



<ul class="wp-block-list">
<li><code>url</code> is the URL of the webpage you want to scrape.</li>



<li><code>response</code> is the HTTP response obtained by sending an HTTP GET request to the URL.</li>



<li><code>soup</code> is a <code>BeautifulSoup</code> object that contains the parsed HTML content of the webpage.</li>



<li><code>soup.find_all('a')</code> finds all the <code>a</code> tags on the webpage.</li>



<li><code>a_tag.get('href')</code> gets the <code>href</code> attribute from an <code>a</code> tag, which is the relative URL.</li>



<li><code>urljoin(url, href)</code> converts the relative URL to an absolute URL by joining it with the base URL.</li>



<li><code>absolute_url</code> is the absolute URL, which is printed to the console.</li>
</ul>



<p>Now that you have a quick overview let&#8217;s dive into the specific problem more deeply and discuss various methods to solve this easily and effectively. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f447.png" alt="👇" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<h2 class="wp-block-heading">Problem Formulation</h2>



<p><strong>Problem: </strong>How do you extract all the absolute URLs from an HTML page?</p>



<p><strong>Example: </strong>Consider the following webpage which has numerous links:</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="355" src="https://blog.finxter.com/wp-content/uploads/2023/09/image-129-1024x355.png" alt="" class="wp-image-1651858" srcset="https://blog.finxter.com/wp-content/uploads/2023/09/image-129-1024x355.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/09/image-129-300x104.png 300w, https://blog.finxter.com/wp-content/uploads/2023/09/image-129-768x266.png 768w, https://blog.finxter.com/wp-content/uploads/2023/09/image-129.png 1266w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div>

<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.finxter.com/wp-content/uploads/2021/02/relative_links-1024x319.png" alt="" class="wp-image-22847" style="object-fit:contain;width:784px;height:243px" width="784" height="243" srcset="https://blog.finxter.com/wp-content/uploads/2021/02/relative_links-1024x319.png 1024w, https://blog.finxter.com/wp-content/uploads/2021/02/relative_links-300x93.png 300w, https://blog.finxter.com/wp-content/uploads/2021/02/relative_links-768x239.png 768w, https://blog.finxter.com/wp-content/uploads/2021/02/relative_links-150x47.png 150w, https://blog.finxter.com/wp-content/uploads/2021/02/relative_links.png 1305w" sizes="auto, (max-width: 784px) 100vw, 784px" /><figcaption class="wp-element-caption"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f517.png" alt="🔗" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Link</strong>: <a href="https://sayonshubham.github.io/">https://sayonshubham.github.io/</a></figcaption></figure>
</div>


<p>Now, when you try to <a href="https://stackoverflow.com/questions/44001007/scrape-the-absolute-url-instead-of-a-relative-path-in-python">scrape</a> the links as highlighted above, you find that only the relative links/paths are extracted instead of the entire absolute path. Let us have a look at the code given below, which demonstrates what happens when you try to extract the <code>'href'</code> elements normally.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from bs4 import BeautifulSoup
import urllib.request
from urllib.parse import urljoin
import requests

web_url = 'https://sayonshubham.github.io/'
headers = {"User-Agent": "Mozilla/5.0 (CrKey armv7l 1.5.16041) AppleWebKit/537.36 (KHTML, like Gecko) "
                         "Chrome/31.0.1650.0 Safari/537.36"}
# get() Request
response = requests.get(web_url, headers=headers)
# Store the webpage contents
webpage = response.content
# Check Status Code (Optional)
# print(response.status_code)
# Create a BeautifulSoup object out of the webpage content
soup = BeautifulSoup(webpage, "html.parser")
for i in soup.find_all('nav'):
    for url in i.find_all('a'):
        print(url['href'])</pre>



<p><strong>Output:</strong></p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">/
/about
/blog
/finxter
/</pre>



<p>The above output is not what you desired. You wanted to extract the absolute paths as shown below:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">https://sayonshubham.github.io/
https://sayonshubham.github.io/about
https://sayonshubham.github.io/blog
https://sayonshubham.github.io/finxter
https://sayonshubham.github.io/</pre>



<p>Without further delay, let us go ahead and try to extract the absolute paths instead of the relative paths. </p>



<h2 class="wp-block-heading">Method 1: Using <span class="has-inline-color has-luminous-vivid-orange-color">urllib.parse.urljoin()</span></h2>



<p>The easiest solution to our problem is to use the <a href="https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urljoin" target="_blank" rel="noreferrer noopener"><code>urllib.parse.urljoin()</code></a> method.</p>



<p>According to the Python documentation: <code data-enlighter-language="generic" class="EnlighterJSRAW">urllib.parse.urljoin()</code> is used to construct a full/absolute URL by combining the “base URL” with another URL. The advantage of using the <code>urljoin()</code>&nbsp;is that it properly resolves the relative path, whether&nbsp;<code>BASE_URL</code>&nbsp;is the domain of the URL, or the absolute URL of the webpage.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from urllib.parse import urljoin

URL_1 = 'http://www.example.com'
URL_2 = 'http://www.example.com/something/index.html'

print(urljoin(URL_1, '/demo'))
print(urljoin(URL_2, '/demo'))</pre>



<p><strong>Output:</strong></p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">http://www.example.com/demo
http://www.example.com/demo</pre>



<p>Now that we have an idea about <code data-enlighter-language="generic" class="EnlighterJSRAW">urljoin</code>, let us have a look at the following code which successfully resolves our problem and helps us to extract the complete/absolute paths from the HTML page.</p>



<p><strong>Solution:</strong></p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from bs4 import BeautifulSoup
import urllib.request
from urllib.parse import urljoin
import requests

web_url = 'https://sayonshubham.github.io/'
headers = {"User-Agent": "Mozilla/5.0 (CrKey armv7l 1.5.16041) AppleWebKit/537.36 (KHTML, like Gecko) "
                         "Chrome/31.0.1650.0 Safari/537.36"}
# get() Request
response = requests.get(web_url, headers=headers)
# Store the webpage contents
webpage = response.content
# Check Status Code (Optional)
# print(response.status_code)
# Create a BeautifulSoup object out of the webpage content
soup = BeautifulSoup(webpage, "html.parser")
for i in soup.find_all('nav'):
    for url in i.find_all('a'):
        print(urljoin(web_url, url.get('href')))</pre>



<p><strong>Output:</strong></p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">https://sayonshubham.github.io/
https://sayonshubham.github.io/about
https://sayonshubham.github.io/blog
https://sayonshubham.github.io/finxter
https://sayonshubham.github.io/</pre>



<h2 class="wp-block-heading">Method 2: Concatenate The Base URL And Relative URL Manually</h2>



<p>Another workaround to our problem is to concatenate the base part of the URL and the relative URLs manually, just like two ordinary strings. The problem, in this case, is that manually adding the strings might lead to &#8220;one-off&#8221; errors &#8212; try to spot the extra front slash characters <code>/</code> below:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">URL_1 = 'http://www.example.com/'
print(URL_1+'/demo')

# Output --> http://www.example.com//demo</pre>



<p>Therefore to ensure proper concatenation, you have to modify your code accordingly such that any extra character that might lead to errors is removed. Let us have a look at the following code that helps us to concatenate the base and the relative paths without the presence of any extra forward slash.</p>



<p><strong><em>Solution:</em></strong></p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from bs4 import BeautifulSoup
import urllib.request
from urllib.parse import urljoin
import requests

web_url = 'https://sayonshubham.github.io/'
headers = {"User-Agent": "Mozilla/5.0 (CrKey armv7l 1.5.16041) AppleWebKit/537.36 (KHTML, like Gecko) "
                         "Chrome/31.0.1650.0 Safari/537.36"}
# get() Request
response = requests.get(web_url, headers=headers)
# Store the webpage contents
webpage = response.content
# Check Status Code (Optional)
# print(response.status_code)
# Create a BeautifulSoup object out of the webpage content
soup = BeautifulSoup(webpage, "html.parser")
for i in soup.find_all('nav'):
    for url in i.find_all('a'):
        # extract the href string
        x = url['href']
        # remove the extra forward-slash if present
        if x[0] == '/':       
            print(web_url + x[1:])
        else:
            print(web_url+x)</pre>



<p><strong>Output:</strong></p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">https://sayonshubham.github.io/
https://sayonshubham.github.io/about
https://sayonshubham.github.io/blog
https://sayonshubham.github.io/finxter
https://sayonshubham.github.io/</pre>



<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/26a0.png" alt="⚠" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong><span style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-vivid-red-color">Caution:</span></strong> This is not the recommended way of extracting the absolute path from a given HTML page. In situations when you have an automated script that needs to resolve a URL but at the time of writing the script you don&#8217;t know what website your script is visiting, in that case, this method won&#8217;t serve your purpose, and your go-to method would be to use <code data-enlighter-language="generic" class="EnlighterJSRAW">urlljoin</code>. Nevertheless, this method deserves to be mentioned because, in our case, it successfully serves the purpose and helps us to extract the absolute URLs.</p>



<h2 class="wp-block-heading">Conclusion</h2>



<p>In this article, we learned how to extract the absolute links from a given HTML page using BeautifulSoup. If you want to master the concepts of Pythons BeautifulSoup library and dive deep into the concepts along with examples and video lessons, please have a look at the following link and follow the articles one by one wherein you will find every aspect of BeautifulSoup explained in great details.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Web Scraping With Beautiful Soup" width="937" height="527" src="https://www.youtube.com/embed/videoseries?list=PLbo6ydLr984ZbU9VrB1ouj9CCJ80x4Xmo" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<p><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f517.png" alt="🔗" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Recommended</strong>: <a href="https://blog.finxter.com/web-scraping-with-beautifulsoup-in-python/" target="_blank" rel="noreferrer noopener">Web Scraping With BeautifulSoup In Python</a></p>



<p>With that, we come to the end of this tutorial! Please <strong><a href="http://blog.finxter.com/subscribe" target="_blank" rel="noreferrer noopener">stay tuned</a></strong> and <strong><a href="https://www.youtube.com/channel/UCRlWL2q80BnI4sA5ISrz9uw" target="_blank" rel="noreferrer noopener">subscribe</a></strong> for more interesting content in the future.</p>



<p>The post <a href="https://blog.finxter.com/scraping-the-absolute-url-of-instead-of-the-relative-path-using-beautifulsoup/">Python BS4 &#8211; How to Scrape Absolute URL Instead of Relative Path</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>3 Pythonic Ways to Download a PDF from a URL</title>
		<link>https://blog.finxter.com/3-pythonic-ways-to-download-a-pdf-from-a-url/</link>
		
		<dc:creator><![CDATA[Chris]]></dc:creator>
		<pubDate>Thu, 20 Jul 2023 19:14:33 +0000</pubDate>
				<category><![CDATA[BeautifulSoup]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Python Requests]]></category>
		<category><![CDATA[Web Scraping]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=1514249</guid>

					<description><![CDATA[<p>If you&#8217;re short on time, here&#8217;s the code for copy and paste: 👇 Let&#8217;s dive into the whole article, keep reading to learn and improve your skills (and enjoy the beautiful spider 🕷️🕸️ images I hand-picked for you)! 👇 💡 Quick overview: I&#8217;ll show you the three most Pythonic ways to download a PDF from ... <a title="3 Pythonic Ways to Download a PDF from a URL" class="read-more" href="https://blog.finxter.com/3-pythonic-ways-to-download-a-pdf-from-a-url/" aria-label="Read more about 3 Pythonic Ways to Download a PDF from a URL">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/3-pythonic-ways-to-download-a-pdf-from-a-url/">3 Pythonic Ways to Download a PDF from a URL</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p><strong>If you&#8217;re short on time, here&#8217;s the code for copy and paste: <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f447.png" alt="👇" class="wp-smiley" style="height: 1em; max-height: 1em;" /></strong></p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import requests

url = 'https://bitcoin.org/bitcoin.pdf'
response = requests.get(url)

with open('sample.pdf', 'wb') as f:
    f.write(response.content)</pre>



<p>Let&#8217;s dive into the whole article, keep reading to learn and improve your skills (and enjoy the beautiful spider <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f577.png" alt="🕷" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f578.png" alt="🕸" class="wp-smiley" style="height: 1em; max-height: 1em;" /> images I hand-picked for you)! <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f447.png" alt="👇" class="wp-smiley" style="height: 1em; max-height: 1em;" /> </p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Quick overview</strong>: I&#8217;ll show you the three most Pythonic ways to download a PDF from a URL in Python: </p>



<ul class="wp-block-list">
<li><strong>Method 1</strong>: Use the <code>requests</code> library, a third-party library that allows you to send HTTP requests using Python. </li>



<li><strong>Method 2</strong>: Use the <code>urllib</code> module, a built-in Python library for handling URLs. </li>



<li><strong>Method 3</strong>: Use the popular BeautifulSoup library for web scraping. </li>
</ul>



<p>But first things first&#8230;</p>



<h2 class="wp-block-heading">Understanding the Basics</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="924" height="611" src="https://blog.finxter.com/wp-content/uploads/2023/07/image-255.png" alt="" class="wp-image-1514271" srcset="https://blog.finxter.com/wp-content/uploads/2023/07/image-255.png 924w, https://blog.finxter.com/wp-content/uploads/2023/07/image-255-300x198.png 300w, https://blog.finxter.com/wp-content/uploads/2023/07/image-255-768x508.png 768w" sizes="auto, (max-width: 924px) 100vw, 924px" /></figure>
</div>


<p>To download PDFs from a URL in Python, one must first understand the basics of web scraping. <a href="https://blog.finxter.com/web-scraping-with-beautifulsoup-in-python/" data-type="post" data-id="17311" target="_blank" rel="noreferrer noopener">Web scraping</a> is the process of extracting data from websites. It involves parsing HTML and other web page content to extract the desired information.</p>



<p><strong>Step 1: </strong>The first step in web scraping is to send an HTTP request to the URL of the web page you want to access. Once you have sent the request, you will receive an HTTP response from the server. This response will contain the HTML content of the web page.</p>



<p><strong>Step 2: </strong>To extract the PDF file link from the HTML content, use a Python library such as Requests and BeautifulSoup. Requests is a Python library used for making HTTP requests to a website, while BeautifulSoup is used for parsing the HTML content of a web page.</p>



<p><strong>Step 3: </strong>Once you have parsed the HTML content and located the PDF file link, you can use the Requests library to download the PDF file. The Requests library provides a simple way to download files from the web. You can use the &#8220;get&#8221; method to download the PDF file from the URL.</p>



<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Note</strong>: Some websites may have restrictions on downloading PDF files. In such cases, you may need to provide additional headers to the HTTP request to bypass these restrictions.</p>



<p>In summary, to download a PDF file from a URL in Python, you need to:</p>



<ol class="wp-block-list">
<li>Send an HTTP request to the URL of the web page you want to access</li>



<li>Parse the HTML content of the web page using BeautifulSoup</li>



<li>Locate the PDF file link in the HTML content</li>



<li>Use the Requests library to download the PDF file from the URL</li>
</ol>



<p class="has-base-2-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Recommended</strong>: <a href="https://blog.finxter.com/is-web-scraping-legal/" data-type="URL" data-id="https://blog.finxter.com/is-web-scraping-legal/" target="_blank" rel="noreferrer noopener">Is Web Scraping Legal?</a></p>



<h2 class="wp-block-heading">Method 1: Using the Requests Library</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="924" height="614" src="https://blog.finxter.com/wp-content/uploads/2023/07/image-256.png" alt="" class="wp-image-1514272" srcset="https://blog.finxter.com/wp-content/uploads/2023/07/image-256.png 924w, https://blog.finxter.com/wp-content/uploads/2023/07/image-256-300x199.png 300w, https://blog.finxter.com/wp-content/uploads/2023/07/image-256-768x510.png 768w" sizes="auto, (max-width: 924px) 100vw, 924px" /></figure>
</div>


<p>Python&#8217;s Requests library is a popular HTTP library that allows developers to send HTTP requests using Python. It is a simple and easy-to-use library that supports various HTTP methods, including GET, POST, PUT, DELETE, and more.</p>



<p>In this section, we will explore how to use the Requests library to download PDF files from a URL in Python.</p>



<h3 class="wp-block-heading">Setting Up Requests</h3>



<p>Before we can use the Requests library, we need to install it. We can install it using <code>pip</code>, which is a package manager for Python. To <a rel="noreferrer noopener" href="https://blog.finxter.com/how-to-install-requests-in-python/" data-type="post" data-id="35966" target="_blank">install <code>requests</code></a>, open a command prompt or terminal, and type the following command:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pip install requests
</pre>



<p>Once installed, we can import the Requests library in our Python script using the following statement:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import requests
</pre>



<h3 class="wp-block-heading">Downloading a PDF File</h3>



<p class="has-global-color-8-background-color has-background">To download a PDF file from a URL using the Requests library, we can use the <code><a href="https://blog.finxter.com/python-requests-get-the-ultimate-guide/" data-type="post" data-id="37837" target="_blank" rel="noreferrer noopener">get()</a></code> method, which sends an HTTP GET request to the specified URL and returns a response object. We can then use the <code>content</code> attribute of the response object to get the binary content of the PDF file.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="394" src="https://blog.finxter.com/wp-content/uploads/2023/07/image-262-1024x394.png" alt="" class="wp-image-1514287" srcset="https://blog.finxter.com/wp-content/uploads/2023/07/image-262-1024x394.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/07/image-262-300x115.png 300w, https://blog.finxter.com/wp-content/uploads/2023/07/image-262-768x296.png 768w, https://blog.finxter.com/wp-content/uploads/2023/07/image-262-1536x591.png 1536w, https://blog.finxter.com/wp-content/uploads/2023/07/image-262.png 1702w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>Here&#8217;s an example code snippet that demonstrates how to download a PDF file using <code>requests</code>:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import requests

url = 'https://bitcoin.org/bitcoin.pdf'
response = requests.get(url)

with open('sample.pdf', 'wb') as f:
    f.write(response.content)</pre>



<p>In this code snippet, we first import the Requests library. We then define the URL of the PDF file we want to download and use the <code>get()</code> method to send an HTTP GET request to the URL. The response object contains the binary content of the PDF file, which we can write to a file using the <code><a href="https://blog.finxter.com/python-open-function/" data-type="post" data-id="24793" target="_blank" rel="noreferrer noopener">open()</a></code> function.</p>



<p>We use the <code>'wb'</code> mode to open the file in binary mode, which allows us to write the binary content of the PDF file to the file. We use the <code>write()</code> method to write the binary content of the PDF file to the file.</p>



<p>That&#8217;s it! We have successfully downloaded a PDF file from a URL using the <code>requests</code> library in Python.</p>



<h2 class="wp-block-heading">Method 2: Utilizing the Urllib Library</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="922" height="619" src="https://blog.finxter.com/wp-content/uploads/2023/07/image-257.png" alt="" class="wp-image-1514273" srcset="https://blog.finxter.com/wp-content/uploads/2023/07/image-257.png 922w, https://blog.finxter.com/wp-content/uploads/2023/07/image-257-300x201.png 300w, https://blog.finxter.com/wp-content/uploads/2023/07/image-257-768x516.png 768w" sizes="auto, (max-width: 922px) 100vw, 922px" /></figure>
</div>


<h3 class="wp-block-heading">Importing Urllib</h3>



<p>The <code>urllib</code> library is a built-in library in Python that allows developers to interact with URLs. Before using the <code>urllib</code> library, developers need to import it into their Python script. </p>



<p>To import the <code>urllib</code> library, developers can use the following code:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import urllib
</pre>



<h3 class="wp-block-heading">Downloading a PDF with Urllib</h3>



<p class="has-global-color-8-background-color has-background">Once the <code>urllib</code> library is imported, you can use it to download PDFs from a URL. To download a PDF using <code>urllib</code>, use the <code>urlretrieve()</code> function, which takes two arguments: the URL of the PDF and the name of the file where the PDF will be saved. </p>



<p>Here&#8217;s an example:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import urllib.request

url = 'http://example.com/some_file.pdf'
filename = 'some_file.pdf'

urllib.request.urlretrieve(url, filename)
</pre>



<p>In this example, the <code>url</code> variable contains the URL of the PDF, and the <code>filename</code> variable contains the name of the file where the PDF will be saved. The <code>urlretrieve()</code> function downloads the PDF from the URL and saves it to the specified filename.</p>



<p>It&#8217;s important to note that the <code>urlretrieve()</code> function only works with Python 3.x. In Python 2.x, you can use the <code>urllib2</code> library to download files. </p>



<p>Here&#8217;s an example:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import urllib2

url = 'http://example.com/some_file.pdf'
filename = 'some_file.pdf'

response = urllib2.urlopen(url)
pdf = response.read()

with open(filename, 'wb') as f:
    f.write(pdf)
</pre>



<p>In this example, the <code>urllib2</code> library is used to download the PDF from the URL. The PDF is then saved to the specified filename using the <code>open()</code> function.</p>



<p>Overall, the <code>urllib</code> library is a useful tool for developers who need to download PDFs from URLs in their Python scripts. With the <code>urlretrieve()</code> function, developers can easily download PDFs and save them to a file.</p>



<h2 class="wp-block-heading">Method 3: Incorporating BeautifulSoup</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="927" height="710" src="https://blog.finxter.com/wp-content/uploads/2023/07/image-258.png" alt="" class="wp-image-1514274" srcset="https://blog.finxter.com/wp-content/uploads/2023/07/image-258.png 927w, https://blog.finxter.com/wp-content/uploads/2023/07/image-258-300x230.png 300w, https://blog.finxter.com/wp-content/uploads/2023/07/image-258-768x588.png 768w" sizes="auto, (max-width: 927px) 100vw, 927px" /></figure>
</div>


<h3 class="wp-block-heading">Integrating BeautifulSoup</h3>



<p>BeautifulSoup is a Python library that is widely used for web scraping purposes. It is a powerful tool for devs like you and me to extract information from HTML and XML documents. </p>



<p>When it comes to downloading PDFs from a website, BeautifulSoup can be used in conjunction with the <code>requests</code> library to extract links to PDF files from the HTML source code of a website.</p>



<p>To start using BeautifulSoup, import it into their Python environment and use the <code>BeautifulSoup()</code> constructor to create a BeautifulSoup object from the HTML source code of a website. Once you have a BeautifulSoup object, use its methods to extract information from the HTML source code.</p>



<h3 class="wp-block-heading">Extracting PDFs from HTML Source</h3>



<p>To extract PDF links from the HTML source code of a website, developers can use BeautifulSoup&#8217;s <code>find_all()</code> method to find all the <code>&lt;a></code> tags in the HTML source code. They can then loop through the <code>&lt;a></code> tags and check if the <code>href</code> attribute of each tag points to a PDF file.</p>



<p>If the <code>href</code> attribute of a tag points to a PDF file, use the <code>requests</code> library to download the PDF file. Use the <code>get()</code> method of the requests library to send an HTTP GET request to the URL of the PDF file. The response object returned by the <code>get()</code> method will contain the contents of the PDF file. Then use Python&#8217;s built-in <a rel="noreferrer noopener" href="https://blog.finxter.com/how-to-open-multiple-files-in-python/" data-type="post" data-id="403242" target="_blank">file handling</a> functions to save the contents of the PDF file to a local file.</p>



<h2 class="wp-block-heading">Handling Errors and Exceptions</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="915" height="512" src="https://blog.finxter.com/wp-content/uploads/2023/07/image-259.png" alt="" class="wp-image-1514276" srcset="https://blog.finxter.com/wp-content/uploads/2023/07/image-259.png 915w, https://blog.finxter.com/wp-content/uploads/2023/07/image-259-300x168.png 300w, https://blog.finxter.com/wp-content/uploads/2023/07/image-259-768x430.png 768w" sizes="auto, (max-width: 915px) 100vw, 915px" /></figure>
</div>


<h3 class="wp-block-heading">Anticipating Common Errors</h3>



<p>When downloading PDF files from URLs using Python, it is essential to anticipate common errors that may occur and prepare for them. </p>



<p>One common error is when the URL is invalid or the PDF file does not exist. </p>



<p>In such cases, the program may crash, and you won&#8217;t receive any feedback. </p>



<p>Another error may occur when you don&#8217;t have the necessary permissions to access the PDF file.</p>



<p>To anticipate such errors, one can use the<a rel="noreferrer noopener" href="https://blog.finxter.com/exploring-pythons-os-module/" data-type="post" data-id="19050" target="_blank"> <code>os</code> module</a> to check if the file exists before downloading it. Additionally, one can check the response status code to ensure the request succeeded. If the status code is not 200, it means that the request was unsuccessful and the PDF file was not downloaded.</p>



<h3 class="wp-block-heading">Implementing Error Handling Functions</h3>



<p>When errors occur, handling them gracefully and providing feedback to the user is essential. One way to do this is by implementing error handling functions that catch the errors and provide feedback to the user.</p>



<p>One can use the <a href="https://blog.finxter.com/python-try-except-an-illustrated-guide/" data-type="post" data-id="367118" target="_blank" rel="noreferrer noopener"><code>try</code> and <code>except</code> statements</a> to catch errors and handle them gracefully. For example, when downloading PDF files, one can catch exceptions such as <code>requests.exceptions.RequestException</code> and <code>IOError</code> and provide feedback to the user.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Python Try Except: An Illustrated Guide" width="937" height="527" src="https://www.youtube.com/embed/s-e0qL0FH9I?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<p>Another way to handle errors is by using error codes. For example, if the user does not have the necessary permissions to access the PDF file, the program can return an error code such as 403, which indicates that the user is forbidden from accessing the file.</p>



<h2 class="wp-block-heading">Organizing Downloaded PDFs</h2>



<p>After downloading PDF files using Python, organizing them properly for easy access and management is important. This section will cover how to create a directory to store downloaded PDFs and how to save the PDFs to that directory.</p>



<h3 class="wp-block-heading">Creating a Directory</h3>



<p>To create a directory to store downloaded PDFs, Python&#8217;s <code>os</code> module can be used. The <code>os</code> module provides a way to interact with the file system and create directories.</p>



<p>Here is an example code snippet that creates a directory called &#8220;PDFs&#8221; in the current working directory:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import os

directory = "PDFs"
if not os.path.exists(directory):
    os.makedirs(directory)
</pre>



<p>This code checks if a directory named &#8220;PDFs&#8221; already exists in the current working directory. If it doesn&#8217;t exist, it creates the directory using the <code>os.makedirs()</code> function.</p>



<h3 class="wp-block-heading">Saving PDFs to a Directory</h3>



<p>Once a directory has been created to store downloaded PDFs, the next step is to save the PDFs to that directory.</p>



<p>Here is an example code snippet that downloads a sample PDF file and saves it to the &#8220;<code>PDFs</code>&#8221; directory:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import requests

url = "https://example.com/sample.pdf"
response = requests.get(url)

filename = "sample.pdf"
filepath = os.path.join("PDFs", filename)

with open(filepath, "wb") as f:
    f.write(response.content)
</pre>



<p>This code downloads a sample PDF file from the URL provided and saves it to a file named &#8220;<code>sample.pdf</code>&#8221; in the &#8220;<code>PDFs</code>&#8221; directory. The <code>os.path.join()</code> function is used to create the full path to the file by joining the directory name and filename together.</p>



<h2 class="wp-block-heading">Frequently Asked Questions</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="618" height="925" src="https://blog.finxter.com/wp-content/uploads/2023/07/image-260.png" alt="" class="wp-image-1514277" srcset="https://blog.finxter.com/wp-content/uploads/2023/07/image-260.png 618w, https://blog.finxter.com/wp-content/uploads/2023/07/image-260-200x300.png 200w" sizes="auto, (max-width: 618px) 100vw, 618px" /></figure>
</div>


<h3 class="wp-block-heading">How can I download a PDF file from a URL using Python?</h3>



<p>There are several ways to download a PDF file from a URL using Python. One of the most popular ways is to use the requests module. This module allows you to send HTTP requests using Python, which can be used to download files from a URL. You can also use the <code>urllib</code> module to download files from a URL.</p>



<h3 class="wp-block-heading">What is the best way to download a PDF file from a website using Python?</h3>



<p>The best way to download a PDF file from a website using Python depends on the specific website and the structure of the website. However, using the <code>requests</code> module is a popular method to download files from a website. You can also use the <code>urllib</code> module to download files from a website.</p>



<h3 class="wp-block-heading">How do I save a PDF file in Python after downloading it from a URL?</h3>



<p>After downloading a PDF file from a URL using Python, you can save it to a directory by using the <code>open()</code> function and the <code>write()</code> method. You will need to specify the file name and the directory where you want to save the file.</p>



<h3 class="wp-block-heading">What is the easiest way to download a PDF file using requests in Python?</h3>



<p>The easiest way to download a PDF file using requests in Python is to use the <code>get()</code> method of the requests module. You will need to specify the URL of the file you want to download and the directory where you want to save the file.</p>



<h3 class="wp-block-heading">How can I scrape a PDF file from a website using BeautifulSoup in Python?</h3>



<p>You can scrape a PDF file from a website using BeautifulSoup in Python by first finding the URL of the PDF file on the website. Once you have the URL, you can use the <code>requests</code> module to download the file and then save it to a directory using the <code>open()</code> function and the write() method.</p>



<h3 class="wp-block-heading">What is the most efficient way to download a file from a URL and save it to a directory using Python?</h3>



<p>The most efficient way to download a file from a URL and save it to a directory using Python is to use the requests module. This module allows you to send HTTP requests using Python, which can be used to download files from a URL. You can also use the <code>urllib</code> module to download files from a URL.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="574" src="https://blog.finxter.com/wp-content/uploads/2023/07/image-261-1024x574.png" alt="" class="wp-image-1514278" srcset="https://blog.finxter.com/wp-content/uploads/2023/07/image-261-1024x574.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/07/image-261-300x168.png 300w, https://blog.finxter.com/wp-content/uploads/2023/07/image-261-768x431.png 768w, https://blog.finxter.com/wp-content/uploads/2023/07/image-261.png 1239w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p class="has-base-2-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Recommended</strong>: <a href="https://blog.finxter.com/5-easy-ways-to-download-an-image-from-a-url-in-python/" data-type="URL" data-id="https://blog.finxter.com/5-easy-ways-to-download-an-image-from-a-url-in-python/" target="_blank" rel="noreferrer noopener">5 Easy Ways to Download an Image from a URL in Python</a></p>
<p>The post <a href="https://blog.finxter.com/3-pythonic-ways-to-download-a-pdf-from-a-url/">3 Pythonic Ways to Download a PDF from a URL</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Solving Response [403] HTTP Forbidden Error: Scraping SEC EDGAR</title>
		<link>https://blog.finxter.com/solving-response-403-http-forbidden-error-scraping-sec-edgar/</link>
		
		<dc:creator><![CDATA[Emily Rosemary Collins]]></dc:creator>
		<pubDate>Tue, 13 Jun 2023 13:52:58 +0000</pubDate>
				<category><![CDATA[BeautifulSoup]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Python Requests]]></category>
		<category><![CDATA[Web Scraping]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=1434923</guid>

					<description><![CDATA[<p>The Securities and Exchange Commission&#8217;s (SEC) Electronic Data Gathering, Analysis, and Retrieval system, known as EDGAR, serves as a rich source of information. This comprehensive database houses financial reports and statements that companies are legally required to disclose, such as a quarterly report filed by institutional investment managers. However, when attempting to extract data from ... <a title="Solving Response [403] HTTP Forbidden Error: Scraping SEC EDGAR" class="read-more" href="https://blog.finxter.com/solving-response-403-http-forbidden-error-scraping-sec-edgar/" aria-label="Read more about Solving Response [403] HTTP Forbidden Error: Scraping SEC EDGAR">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/solving-response-403-http-forbidden-error-scraping-sec-edgar/">Solving Response [403] HTTP Forbidden Error: Scraping SEC EDGAR</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>The <a rel="noreferrer noopener" href="https://www.sec.gov/" data-type="URL" data-id="https://www.sec.gov/" target="_blank">Securities and Exchange Commission&#8217;s (SEC)</a> Electronic Data Gathering, Analysis, and Retrieval system, known as EDGAR, serves as a rich source of information. This comprehensive database houses financial reports and statements that companies are legally required to disclose, such as a quarterly report filed by institutional investment managers.</p>



<p>However, when attempting to extract data from EDGAR via web scraping, you might encounter a stumbling block: an HTTPError that reads,<strong> &#8220;HTTP Error 403: Forbidden.&#8221;</strong> </p>



<p>This is a common issue faced by many data enthusiasts and researchers trying to access data programmatically from the EDGAR database.</p>



<h2 class="wp-block-heading">Understanding the Error</h2>



<p>HTTP Error 403, often termed as a <strong>&#8216;Forbidden&#8217;</strong> error, is an HTTP status code signifying that the server understood the request but refuses to authorize it. This doesn&#8217;t necessarily mean the requester did something wrong; rather, it implies that accessing the required resource is forbidden for some reason.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="530" src="https://blog.finxter.com/wp-content/uploads/2023/06/image-119-1024x530.png" alt="" class="wp-image-1434996" srcset="https://blog.finxter.com/wp-content/uploads/2023/06/image-119-1024x530.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/06/image-119-300x155.png 300w, https://blog.finxter.com/wp-content/uploads/2023/06/image-119-768x397.png 768w, https://blog.finxter.com/wp-content/uploads/2023/06/image-119-1536x795.png 1536w, https://blog.finxter.com/wp-content/uploads/2023/06/image-119.png 1741w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption"><em><strong>Screenshot</strong>: Accessing the page may work in the browser but not in your Python code.</em></figcaption></figure>
</div>


<p>When you encounter an HTTP 403 error while accessing the EDGAR 13F filings, it means the EDGAR server has denied your request to download the data. This is typically because the request appears to be from a script or a bot rather than a human using a web browser.</p>



<h2 class="wp-block-heading">Bypassing the Error</h2>



<p class="has-global-color-8-background-color has-background">One common workaround for the 403 error is to <strong>modify the HTTP request&#8217;s user-agent header</strong> to imitate a web browser. Web servers use the user-agent header to identify the client making the request and can sometimes restrict access based on this information.</p>



<p>Here is a Python example using the <code>requests</code> library:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="4" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import requests

url = 'https://www.sec.gov/Archives/edgar/data/.../' # Put your target URL here
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
</pre>



<p>In this example, we set the User-Agent to mimic a common web browser, effectively tricking the server into treating the script as a regular user.</p>



<p class="has-base-2-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f469-200d-1f4bb.png" alt="👩‍💻" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Recommended</strong>: <a href="https://blog.finxter.com/python-requests-library/" data-type="URL" data-id="https://blog.finxter.com/python-requests-library/" target="_blank" rel="noreferrer noopener">Python Requests Library – Your First HTTP Request in Python</a></p>



<h2 class="wp-block-heading">Caution and Consideration</h2>



<p>While this technique may help bypass the 403 error, it&#8217;s crucial to emphasize that it should be used responsibly. The SEC might have legitimate reasons for preventing certain types of access to their system. Overuse or misuse of this workaround might lead to IP blocking or other consequences.</p>



<p>Moreover, remember that it&#8217;s important to respect the terms of service of the website you&#8217;re accessing and adhere to any rate limits or access restrictions. Before you use scraping techniques, it&#8217;s advisable to review the SEC&#8217;s EDGAR access rules and usage guidelines.</p>



<p class="has-base-2-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f469-200d-1f4bb.png" alt="👩‍💻" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Recommended</strong>: <a href="https://blog.finxter.com/is-web-scraping-legal/" data-type="post" data-id="383048" target="_blank" rel="noreferrer noopener">Is Web Scraping Legal?</a></p>



<p></p>
<p>The post <a href="https://blog.finxter.com/solving-response-403-http-forbidden-error-scraping-sec-edgar/">Solving Response [403] HTTP Forbidden Error: Scraping SEC EDGAR</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Python Web Scraping: From URL to CSV in No Time</title>
		<link>https://blog.finxter.com/python-web-scraping-from-url-to-csv-in-no-time/</link>
		
		<dc:creator><![CDATA[Chris]]></dc:creator>
		<pubDate>Sun, 23 Apr 2023 18:35:54 +0000</pubDate>
				<category><![CDATA[BeautifulSoup]]></category>
		<category><![CDATA[CSV]]></category>
		<category><![CDATA[Data Conversion]]></category>
		<category><![CDATA[File Handling]]></category>
		<category><![CDATA[Pandas Library]]></category>
		<category><![CDATA[Web Scraping]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=1313474</guid>

					<description><![CDATA[<p>Setting up the Environment Before diving into web scraping with Python, set up your environment by installing the necessary libraries. First, install the following libraries: requests, BeautifulSoup, and pandas. These packages play a crucial role in web scraping, each serving different purposes.✨ To install these libraries, click on the previously provided links for a full ... <a title="Python Web Scraping: From URL to CSV in No Time" class="read-more" href="https://blog.finxter.com/python-web-scraping-from-url-to-csv-in-no-time/" aria-label="Read more about Python Web Scraping: From URL to CSV in No Time">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/python-web-scraping-from-url-to-csv-in-no-time/">Python Web Scraping: From URL to CSV in No Time</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading">Setting up the Environment</h2>



<p>Before diving into web scraping with Python, set up your environment by installing the necessary libraries.</p>



<p>First, install the following libraries: <code><a rel="noreferrer noopener" href="https://blog.finxter.com/how-to-install-requests-in-python/" data-type="post" data-id="35966" target="_blank">requests</a></code>, <code><a rel="noreferrer noopener" href="https://blog.finxter.com/how-to-install-beautifulsoup4-in-python/" data-type="post" data-id="457056" target="_blank">BeautifulSoup</a></code>, and <code><a rel="noreferrer noopener" href="https://blog.finxter.com/how-to-install-pandas-in-python/" data-type="post" data-id="35926" target="_blank">pandas</a></code>. These packages play a crucial role in web scraping, each serving different purposes.<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2728.png" alt="✨" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<p>To install these libraries, click on the previously provided links for a full guide (including troubleshooting) or simply run the following commands:</p>



<pre class="wp-block-preformatted"><code>pip install requests
pip install beautifulsoup4
pip install pandas</code>
</pre>



<p>The <code>requests</code> library will be used to make HTTP requests to websites and download the HTML content. It simplifies the process of fetching web content in Python.</p>



<p><code>BeautifulSoup</code> is a fantastic library that helps extract data from the HTML content fetched from websites. It makes navigating, searching, and modifying HTML easy, making web scraping straightforward and convenient.</p>



<p><code>Pandas</code> will be helpful in data manipulation and organizing the scraped data into a CSV file. It provides powerful tools for working with structured data, making it popular among data scientists and web scraping enthusiasts. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f43c.png" alt="🐼" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<h2 class="wp-block-heading">Fetching and Parsing URL</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="743" height="495" src="https://blog.finxter.com/wp-content/uploads/2023/04/image-230.png" alt="" class="wp-image-1313580" srcset="https://blog.finxter.com/wp-content/uploads/2023/04/image-230.png 743w, https://blog.finxter.com/wp-content/uploads/2023/04/image-230-300x200.png 300w" sizes="auto, (max-width: 743px) 100vw, 743px" /></figure>
</div>


<p>Next, you&#8217;ll learn how to fetch and parse URLs using Python to <strong>scrape data and save it as a CSV file</strong>. We will cover sending HTTP requests, handling errors, and utilizing libraries to make the process efficient and smooth. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f60a.png" alt="😊" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<h3 class="wp-block-heading">Sending HTTP Requests</h3>



<p>When fetching content from a URL, Python offers a powerful library known as the <code>requests</code> library. It allows users to send HTTP requests, such as GET or POST, to a specific URL, obtain a response, and parse it for information. </p>



<p>We will use the <code>requests</code> library to help us fetch data from our desired URL. </p>



<p>For example:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import requests
response = requests.get('https://example.com/data.csv')</pre>



<p>The variable <code>response</code> will store the server&#8217;s response, including the data we want to scrape. From here, we can access the content using <code>response.content</code>, which will return the raw data in <a href="https://blog.finxter.com/python-bytes-vs-bytearray/" data-type="post" data-id="870390" target="_blank" rel="noreferrer noopener">bytes</a> format. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f310.png" alt="🌐" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<h3 class="wp-block-heading">Handling HTTP Errors</h3>



<p>Handling HTTP errors while fetching data from URLs ensures a smooth experience and prevents unexpected issues. The <code>requests</code> library makes error handling easy by providing methods to check whether the request was successful. </p>



<p>Here&#8217;s a simple example:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import requests
response = requests.get('https://example.com/data.csv')
response.raise_for_status()</pre>



<p>The <code>raise_for_status()</code> method will raise an exception if there&#8217;s an HTTP error, such as a 404 Not Found or 500 Internal Server Error. This helps us ensure that our script doesn&#8217;t continue to process erroneous data, allowing us to gracefully handle any issues that may arise. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f6e0.png" alt="🛠" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<p>With these tools, you are now better equipped to fetch and parse URLs using Python. This will enable you to <strong>effectively scrape data and save it as a CSV</strong> file. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f40d.png" alt="🐍" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<h2 class="wp-block-heading">Extracting Data from HTML</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="743" height="499" src="https://blog.finxter.com/wp-content/uploads/2023/04/image-231.png" alt="" class="wp-image-1313581" srcset="https://blog.finxter.com/wp-content/uploads/2023/04/image-231.png 743w, https://blog.finxter.com/wp-content/uploads/2023/04/image-231-300x201.png 300w" sizes="auto, (max-width: 743px) 100vw, 743px" /></figure>
</div>


<p>In this section, we&#8217;ll discuss extracting data from HTML using Python. The focus will be on utilizing the BeautifulSoup library, locating elements by their tags, and attributes. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f60a.png" alt="😊" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<h3 class="wp-block-heading">Using BeautifulSoup</h3>



<p>BeautifulSoup is a popular Python library that simplifies web scraping tasks by making it easy to parse and navigate through HTML. To get started, import the library and request the page content you want to scrape, then create a BeautifulSoup object to parse the data:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from bs4 import BeautifulSoup
import requests

url = "example_website"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
</pre>



<p>Now you have a BeautifulSoup object and can start extracting data from the HTML. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f680.png" alt="🚀" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<h3 class="wp-block-heading">Locating Elements by Tags and Attributes</h3>



<p>BeautifulSoup provides various methods to locate elements by their tags and attributes. Some common methods include <code>find()</code>, <code>find_all()</code>, <code>select()</code>, and <code>select_one()</code>. </p>



<p>Let&#8217;s see these methods in action:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># Find the first &lt;span> tag
span_tag = soup.find("span")

# Find all &lt;span> tags
all_span_tags = soup.find_all("span")

# Locate elements using CSS selectors
title = soup.select_one("title")

# Find all &lt;a> tags with the "href" attribute
links = soup.find_all("a", {"href": True})
</pre>



<p>These methods allow you to easily navigate and extract data from an HTML structure. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f9d0.png" alt="🧐" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<p>Once you have located the HTML elements containing the needed data, you can extract the text and attributes. </p>



<p>Here&#8217;s how:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># Extract text from a tag
text = span_tag.text

# Extract an attribute value
url = links[0]["href"]
</pre>



<p>Finally, to save the extracted data into a CSV file, you can use Python&#8217;s built-in <code>csv</code> module. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f603.png" alt="😃" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import csv

# Writing extracted data to a CSV file
with open("output.csv", "w", newline="") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["Index", "Title"])
    for index, link in enumerate(links, start=1):
        writer.writerow([index, link.text])
</pre>



<p>Following these steps, you can successfully extract data from HTML using Python and BeautifulSoup, and save it as a CSV file. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f389.png" alt="🎉" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<p class="has-base-2-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Recommended</strong>: <a href="https://blog.finxter.com/basketball-statistics-page-scraping-using-python-and-beautifulsoup/" data-type="post" data-id="1081082" target="_blank" rel="noreferrer noopener">Basketball Statistics – Page Scraping Using Python and BeautifulSoup</a></p>



<h2 class="wp-block-heading">Organizing Data</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="743" height="531" src="https://blog.finxter.com/wp-content/uploads/2023/04/image-232.png" alt="" class="wp-image-1313582" srcset="https://blog.finxter.com/wp-content/uploads/2023/04/image-232.png 743w, https://blog.finxter.com/wp-content/uploads/2023/04/image-232-300x214.png 300w" sizes="auto, (max-width: 743px) 100vw, 743px" /></figure>
</div>


<p>This section explains how to <a rel="noreferrer noopener" href="https://blog.finxter.com/how-to-create-a-dictionary-from-two-lists/" data-type="post" data-id="316802" target="_blank">create a dictio</a>nary to store the scraped data and how to write the organized data into a CSV file. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f60a.png" alt="😊" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<h3 class="wp-block-heading">Creating a Dictionary</h3>



<p>Begin by defining an empty dictionary that will store the extracted data elements. </p>



<p>In this case, the focus is on quotes, authors, and any associated tags. Each extracted element should have its key, and the value should be a list that contains individual instances of that element. </p>



<p>For example:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">
data = {
    "quotes": [],
    "authors": [],
    "tags": []
}
</pre>



<p>As you scrape the data, append each item to its respective <a href="https://blog.finxter.com/python-lists/" data-type="post" data-id="7332" target="_blank" rel="noreferrer noopener">list</a>. This approach makes the information easy to index and retrieve when needed. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4da.png" alt="📚" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<h3 class="wp-block-heading">Working with DataFrames and Pandas</h3>



<p>Once the data is stored in a dictionary, it&#8217;s time to <a href="https://blog.finxter.com/dictionary-of-lists-to-dataframe-python-conversion/" data-type="post" data-id="1296622" target="_blank" rel="noreferrer noopener">convert it into a dataframe</a>. Using the <a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">Pandas</a> library, it&#8217;s easy to transform the dictionary into a dataframe where the keys become the column names and the respective lists become the rows. </p>



<p>Simply use the following command:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd

df = pd.DataFrame(data)</pre>



<h3 class="wp-block-heading">Exporting Data to a CSV File</h3>



<p>With the dataframe prepared, it&#8217;s time to write it to a CSV file. Thankfully, Pandas comes to the rescue once again. Using the dataframe&#8217;s built-in <code><a href="https://blog.finxter.com/convert-html-table-to-csv-in-python/" data-type="post" data-id="590862" target="_blank" rel="noreferrer noopener">.to_csv()</a></code> method, it&#8217;s possible to create a CSV file from the dataframe, like this:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">
df.to_csv('scraped_data.csv', index=False)
</pre>



<p>This command will generate a CSV file called <code>'scraped_data.csv'</code> containing the organized data with columns for quotes, authors, and tags. The <code>index=False</code> parameter ensures that the dataframe&#8217;s index isn&#8217;t added as an additional column. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4dd.png" alt="📝" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<p class="has-base-2-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Recommended</strong>: <a href="https://blog.finxter.com/read-a-csv-file-to-a-pandas-dataframe/" data-type="post" data-id="440655" target="_blank" rel="noreferrer noopener">17 Ways to Read a CSV File to a Pandas DataFrame</a></p>



<p>And there you have it—a neat, organized CSV file containing your scraped data!</p>



<h2 class="wp-block-heading">Handling Pagination</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="743" height="575" src="https://blog.finxter.com/wp-content/uploads/2023/04/image-233.png" alt="" class="wp-image-1313583" srcset="https://blog.finxter.com/wp-content/uploads/2023/04/image-233.png 743w, https://blog.finxter.com/wp-content/uploads/2023/04/image-233-300x232.png 300w" sizes="auto, (max-width: 743px) 100vw, 743px" /></figure>
</div>


<p>This section will discuss how to handle pagination while scraping data from multiple URLs using Python to save the extracted content in a CSV format. It is essential to manage pagination effectively because most websites display their content across several pages.<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<h3 class="wp-block-heading">Looping Through Web Pages</h3>



<p>Looping through web pages requires the developer to identify a pattern in the URLs, which can assist in iterating over them seamlessly. Typically, this pattern would include the page number as a variable, making it easy to adjust during the scraping process.<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f501.png" alt="🔁" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<p>Once the pattern is identified, you can use a for loop to iterate over a range of page numbers. For each iteration, update the URL with the page number and then proceed with the scraping process. This method allows you to extract data from multiple pages systematically.<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f5a5.png" alt="🖥" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<p>For instance, let&#8217;s consider that the base URL for every page is <code><em>"https://www.example.com/listing?page="</em></code>, where the page number is appended to the end. </p>



<p>Here is a Python example that demonstrates handling pagination when working with such URLs:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import requests
from bs4 import BeautifulSoup
import csv

base_url = "https://www.example.com/listing?page="

with open("scraped_data.csv", "w", newline="") as csvfile:
    csv_writer = csv.writer(csvfile)
    csv_writer.writerow(["Data_Title", "Data_Content"])  # Header row

    for page_number in range(1, 6):  # Loop through page numbers 1 to 5
        url = base_url + str(page_number)
        response = requests.get(url)
        soup = BeautifulSoup(response.text, "html.parser")
        
        # TODO: Add scraping logic here and write the content to CSV file.<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f40d.png" alt="🐍" class="wp-smiley" style="height: 1em; max-height: 1em;" />

</pre>



<p>In this example, the script iterates through the first five pages of the website and writes the scraped content to a CSV file. Note that you will need to implement the actual scraping logic (e.g., extracting the desired content using Beautiful Soup) based on the website&#8217;s structure.<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f310.png" alt="🌐" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<p>Handling pagination with Python allows you to collect more comprehensive data sets<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4be.png" alt="💾" class="wp-smiley" style="height: 1em; max-height: 1em;" />, improving the overall success of your web scraping efforts. Make sure to respect the website&#8217;s <code>robots.txt</code> rules and rate limits to ensure responsible data collection.<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f916.png" alt="🤖" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<h2 class="wp-block-heading">Exporting Data to CSV</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="743" height="546" src="https://blog.finxter.com/wp-content/uploads/2023/04/image-234.png" alt="" class="wp-image-1313584" srcset="https://blog.finxter.com/wp-content/uploads/2023/04/image-234.png 743w, https://blog.finxter.com/wp-content/uploads/2023/04/image-234-300x220.png 300w" sizes="auto, (max-width: 743px) 100vw, 743px" /></figure>
</div>


<p>You can export web scraping data to a CSV file in Python using the Python CSV module and the Pandas <code>to_csv</code> function. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f603.png" alt="😃" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Both approaches are widely used and efficiently handle large amounts of data.</p>



<h3 class="wp-block-heading">Python CSV Module</h3>



<p>The Python CSV module is a built-in library that offers functionalities to read from and write to CSV files. It is simple and easy to use<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f44d.png" alt="👍" class="wp-smiley" style="height: 1em; max-height: 1em;" />. To begin with, first, import the <code>csv</code> module.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import csv
</pre>



<p>To write the scraped data to a CSV file, <a href="https://blog.finxter.com/python-open-function/" data-type="post" data-id="24793" target="_blank" rel="noreferrer noopener">open</a> the file in write mode (<code>'w'</code>) with a specified file name, create a <a href="https://blog.finxter.com/write-python-dict-to-csv-columns-keys-first-values-second-column/" data-type="post" data-id="570680" target="_blank" rel="noreferrer noopener">CSV writer</a> object, and write the data using the <code>writerow()</code> or <code>writerows()</code> methods as required.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">with open('data.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(["header1", "header2", "header3"])
    writer.writerows(scraped_data)
</pre>



<p>In this example, the header row is written first, followed by the rows of data obtained through web scraping. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f60a.png" alt="😊" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<h3 class="wp-block-heading">Using Pandas to_csv()</h3>



<p>Another alternative is the powerful library Pandas, often used in data manipulation and analysis. To use it, start by importing the Pandas library.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd
</pre>



<p>Pandas offers the <code>to_csv()</code> method, which can be applied to a DataFrame. If you have web-scraped data and stored it in a DataFrame, you can easily export it to a CSV file with the <code>to_csv()</code> method, as shown below:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">dataframe.to_csv('data.csv', index=False)
</pre>



<p>In this example, the index parameter is set to <code>False</code> to exclude the DataFrame index from the CSV file. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4ca.png" alt="📊" class="wp-smiley" style="height: 1em; max-height: 1em;" /> </p>



<p>The Pandas library also provides options for handling missing values, date formatting, and customizing separators and delimiters, making it a versatile choice for data export.</p>



<h2 class="wp-block-heading">10 Minutes to Pandas in 5 Minutes </h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="743" height="495" src="https://blog.finxter.com/wp-content/uploads/2023/04/image-235.png" alt="" class="wp-image-1313585" srcset="https://blog.finxter.com/wp-content/uploads/2023/04/image-235.png 743w, https://blog.finxter.com/wp-content/uploads/2023/04/image-235-300x200.png 300w" sizes="auto, (max-width: 743px) 100vw, 743px" /></figure>
</div>


<p>If you&#8217;re just getting started with Pandas, I&#8217;d recommend you check out our free blog guide (it&#8217;s only 5 minutes!): <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f43c.png" alt="🐼" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<p class="has-base-2-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Recommended</strong>: <a href="https://blog.finxter.com/pandas-quickstart/" data-type="post" data-id="16511" target="_blank" rel="noreferrer noopener">5 Minutes to Pandas &#8212; A Simple Helpful Guide to the Most Important Pandas Concepts (+ Cheat Sheet)</a></p>
<p>The post <a href="https://blog.finxter.com/python-web-scraping-from-url-to-csv-in-no-time/">Python Web Scraping: From URL to CSV in No Time</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to Access the First, Second, or N-th Child Div Element in BeautifulSoup?</title>
		<link>https://blog.finxter.com/how-to-access-the-first-second-or-n-th-child-div-element-in-beautifulsoup/</link>
		
		<dc:creator><![CDATA[Chris]]></dc:creator>
		<pubDate>Tue, 21 Mar 2023 11:03:05 +0000</pubDate>
				<category><![CDATA[BeautifulSoup]]></category>
		<category><![CDATA[HTML]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Web Scraping]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=1230960</guid>

					<description><![CDATA[<p>To access the first, second, or N-th child div element in BeautifulSoup, use the .contents or .find_all() methods on a parent div element. The .contents method returns a list of children, including tags and strings, while .find_all() returns a list of matching tags only. Simply select the desired index to obtain the child div element ... <a title="How to Access the First, Second, or N-th Child Div Element in BeautifulSoup?" class="read-more" href="https://blog.finxter.com/how-to-access-the-first-second-or-n-th-child-div-element-in-beautifulsoup/" aria-label="Read more about How to Access the First, Second, or N-th Child Div Element in BeautifulSoup?">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/how-to-access-the-first-second-or-n-th-child-div-element-in-beautifulsoup/">How to Access the First, Second, or N-th Child Div Element in BeautifulSoup?</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p class="has-global-color-8-background-color has-background">To access the first, second, or N-th child div element in BeautifulSoup, use the <code>.contents</code> or <code>.find_all()</code> methods on a parent div element. The <code>.contents</code> method returns a list of children, including tags and strings, while <code>.find_all()</code> returns a list of matching tags only. Simply select the desired index to obtain the child div element you need.</p>



<p>In Beautiful Soup, you can navigate to the first, second, or third <code>div</code> within a parent <code>div</code> using the <code>.contents</code> or <code>.find_all()</code> methods. </p>



<p>Here&#8217;s an example:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="16-19,26-27" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from bs4 import BeautifulSoup

html = """
&lt;div id="parent-div">
    &lt;div class="child-div">First child div&lt;/div>
    &lt;div class="child-div">Second child div&lt;/div>
    &lt;div class="child-div">Third child div&lt;/div>
&lt;/div>
"""

soup = BeautifulSoup(html, 'html.parser')

# Find the parent div
parent_div = soup.find('div', {'id': 'parent-div'})

# Method 1: Using .contents
first_child_div = parent_div.contents[1]
second_child_div = parent_div.contents[3]
third_child_div = parent_div.contents[5]

print("Using .contents:")
print("First child div:", first_child_div.text)
print("Second child div:", second_child_div.text)
print("Third child div:", third_child_div.text)

# Method 2: Using .find_all()
all_child_divs = parent_div.find_all('div', {'class': 'child-div'})

print("\nUsing .find_all():")
print("First child div:", all_child_divs[0].text)
print("Second child div:", all_child_divs[1].text)
print("Third child div:", all_child_divs[2].text)
</pre>



<p>The output of this script is:</p>



<pre class="wp-block-preformatted"><code>Using .contents:
First child div: First child div
Second child div: Second child div
Third child div: Third child div

Using .find_all():
First child div: First child div
Second child div: Second child div
Third child div: Third child div</code>
</pre>



<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Note</strong>:<br><br>The <code>.contents</code> solution returns a list of the parent element&#8217;s children, including tags and strings. Note that the indexing numbers are shifted using this solution, i.e., the first element is indexed using <code>.contents[1]</code>, the second with <code>.content[3]</code>, and the <code>n</code>-th with <code>.contents[2*n-1]</code>.<br><br>The <code>.find_all()</code> solution returns a list of matching tags only. </p>



<p>You can use either method to navigate to the first, second, or third <code>div</code> within a parent <code>div</code>.</p>



<h2 class="wp-block-heading">Keep Learning</h2>



<p>If you want to learn BeautifulSoup from scratch, I&#8217;d recommend you check out our academy course:</p>



<figure class="wp-block-image size-full"><a href="https://academy.finxter.com/university/web-scraping-with-beautifulsoup/" target="_blank" rel="noreferrer noopener"><img loading="lazy" decoding="async" width="994" height="876" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-233.png" alt="" class="wp-image-1230972" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-233.png 994w, https://blog.finxter.com/wp-content/uploads/2023/03/image-233-300x264.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-233-768x677.png 768w" sizes="auto, (max-width: 994px) 100vw, 994px" /></a></figure>
<p>The post <a href="https://blog.finxter.com/how-to-access-the-first-second-or-n-th-child-div-element-in-beautifulsoup/">How to Access the First, Second, or N-th Child Div Element in BeautifulSoup?</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>I Built a Kids&#8217; Movie Ratings Database Using Beautiful Soup</title>
		<link>https://blog.finxter.com/i-built-a-kids-movie-ratings-database-using-beautiful-soup/</link>
		
		<dc:creator><![CDATA[Stephen Schwaner]]></dc:creator>
		<pubDate>Thu, 09 Mar 2023 13:14:48 +0000</pubDate>
				<category><![CDATA[BeautifulSoup]]></category>
		<category><![CDATA[Projects]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Web Scraping]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=1194257</guid>

					<description><![CDATA[<p>Project Motivation My wife and I are pretty discerning about which movies we allow our two daughters (ages 4 and 5) to watch. Recently, we were in conversation with their teachers at school about assembling a good list of age-appropriate movies. To simplify the process, I decided to build a database of movie ratings that ... <a title="I Built a Kids&#8217; Movie Ratings Database Using Beautiful Soup" class="read-more" href="https://blog.finxter.com/i-built-a-kids-movie-ratings-database-using-beautiful-soup/" aria-label="Read more about I Built a Kids&#8217; Movie Ratings Database Using Beautiful Soup">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/i-built-a-kids-movie-ratings-database-using-beautiful-soup/">I Built a Kids&#8217; Movie Ratings Database Using Beautiful Soup</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="I Built a Kids’ Movie Ratings Database Using Beautiful Soup" width="937" height="527" src="https://www.youtube.com/embed/fL5YrE2k8uo?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">Project Motivation</h2>



<p>My wife and I are pretty discerning about which movies we allow our two daughters (ages 4 and 5) to watch. </p>



<p>Recently, we were in conversation with their teachers at school about assembling a good list of age-appropriate movies. To simplify the process, I decided to build a database of movie ratings that is easily sortable/filterable by scraping information from relevant websites. </p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="976" height="637" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-113.png" alt="" class="wp-image-1194318" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-113.png 976w, https://blog.finxter.com/wp-content/uploads/2023/03/image-113-300x196.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-113-768x501.png 768w" sizes="auto, (max-width: 976px) 100vw, 976px" /></figure>
</div>


<p>There are a few websites that we use to determine whether a movie is age-appropriate, but one of our favorites is <a href="https://kids-in-mind.com/" target="_blank" rel="noreferrer noopener">Kids-In-Mind</a>, so I decided to start there. Kids-In-Mind provides a ranking from 0 (none) to 10 (extreme) for a movie’s sex, violence, and foul language content. I set out to pull all of these ratings and condense them into a single Excel sheet that I could sort and filter however I like.</p>



<h2 class="wp-block-heading">What You Will Learn</h2>



<p>This article is written for someone familiar with Python, but who is a beginner at web scraping. This <a href="https://web.stanford.edu/group/csp/cs21/htmlcheatsheet.pdf" target="_blank" rel="noreferrer noopener">HTML cheat sheet</a> may be a helpful resource for quickly looking up different HTML tags.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="981" height="649" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-114.png" alt="" class="wp-image-1194319" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-114.png 981w, https://blog.finxter.com/wp-content/uploads/2023/03/image-114-300x198.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-114-768x508.png 768w" sizes="auto, (max-width: 981px) 100vw, 981px" /></figure>
</div>


<p>In this article, you will learn how I:</p>



<ul class="wp-block-list">
<li>Came up with a plan for scraping data from Kids-In-Mind</li>



<li>Examined the HTML for the relevant web pages</li>



<li>Used <strong>BeautifulSoup</strong> to parse the HTML for movie rating information</li>



<li>Handled variations in how pages were organized</li>



<li>Used pandas to write the resulting data to a CSV file</li>
</ul>



<p>In the rest of the article, I will abbreviate <strong>BeautifulSoup</strong> as <strong>bs4.</strong></p>



<p>You can download the full script here <a rel="noreferrer noopener" href="https://github.com/finxter/WebScrapeKidsMovies" data-type="URL" data-id="https://github.com/finxter/WebScrapeKidsMovies" target="_blank">https://github.com/finxter/WebScrapeKidsMovies</a>. I also attach the full script to the end of this page, so keep reading! <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f447.png" alt="👇" class="wp-smiley" style="height: 1em; max-height: 1em;" /> </p>



<h2 class="wp-block-heading">Planning the Scraping Approach</h2>



<p>First, things first, how should I get started? When I visited the Kids-In-Mind home page, I noticed that they have a link to an “A-Z Index.” Jackpot! I realized I could visit each “letter” page and either follow links or pull information to get the data I needed.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="449" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-105-1024x449.png" alt="" class="wp-image-1194261" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-105-1024x449.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/03/image-105-300x132.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-105-768x337.png 768w, https://blog.finxter.com/wp-content/uploads/2023/03/image-105-1536x674.png 1536w, https://blog.finxter.com/wp-content/uploads/2023/03/image-105.png 1600w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>I was pleasantly surprised again when I visited the “A” page. The title, MPAA rating, year, and content ratings were all contained right there on the page! I decided to pull the HTML from each “letter” page and then parse that HTML to scrape information for each movie. </p>



<p>Clicking on the links to the “A” and “B” pages took me to the following URLs:</p>



<ul class="wp-block-list">
<li><a href="https://kids-in-mind.com/a.htm" target="_blank" rel="noreferrer noopener">https://kids-in-mind.com/a.htm</a></li>



<li><a href="https://kids-in-mind.com/b.htm" target="_blank" rel="noreferrer noopener">https://kids-in-mind.com/b.htm</a></li>
</ul>



<p>As you can see, simply exchanging the “a” for the “b” allowed me to navigate to each “letter” page on the site. This is how I decided to iterate through pages to pull information for all the movies on the site.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="528" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-106-1024x528.png" alt="" class="wp-image-1194263" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-106-1024x528.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/03/image-106-300x155.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-106-768x396.png 768w, https://blog.finxter.com/wp-content/uploads/2023/03/image-106-1536x792.png 1536w, https://blog.finxter.com/wp-content/uploads/2023/03/image-106.png 1600w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>To proceed, I still needed to figure out how each page was structured. I right-clicked on the first movie (<em>Abandon</em>) and selected the “Inspect” option (I’m using Google Chrome). </p>



<p>You can see that:</p>



<ul class="wp-block-list">
<li>The list of movies is contained within a <code>&lt;div></code> tag with an attribute <code>class = "et_pb_text_inner"</code> <strong>(1)</strong>,</li>



<li>The link and movie titles are each contained with an <code>&lt;a></code> tag <strong>(2)</strong>, and</li>



<li>and the year and ratings are contained within text trailing each <code>&lt;a></code> tag <strong>(3)</strong>.</li>
</ul>



<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Note</strong>: <em>Since I’m new to HTML, I initially thought the text with rating information was associated with each <code>&lt;a></code> tag. Upon closer inspection using </em><strong><em>BeautifulSoup</em></strong><em>, I found out that the text was actually associated with the <code>&lt;div></code> tag. You’ll see that in the code, further down.</em></p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="513" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-107-1024x513.png" alt="" class="wp-image-1194267" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-107-1024x513.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/03/image-107-300x150.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-107-768x385.png 768w, https://blog.finxter.com/wp-content/uploads/2023/03/image-107-1536x770.png 1536w, https://blog.finxter.com/wp-content/uploads/2023/03/image-107.png 1600w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>In addition to the number of ratings for each content category, I also wanted to pull more detailed information about the sex content. </p>



<p>Since my kids are so young, sometimes even movies with low sex ratings can be inappropriate for them. For example, the movie might be aimed at a 10-year-old even though it is rated G with a sex rating of 1.</p>



<p>To get this content, I needed to follow each movie link to that movie’s page. I clicked on the “<em>The Adventures of Rocky and Bullwinkle</em>” link and used the “Inspect” tool to check out the HTML defining the movie’s “Sex/Nudity” section. </p>



<p>You can see:</p>



<ul class="wp-block-list">
<li>There is a<strong> </strong><code>&lt;span></code> tag <strong>(2)</strong> nested inside of a <code>&lt;p></code> tag <strong>(1)</strong>,</li>



<li>the <code>&lt;span></code> tag contains the paragraph heading, “Sex/Nudity” <strong>(3),</strong></li>



<li>and the text <strong>(4)</strong> trails the <code>&lt;span></code> tag.</li>
</ul>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="510" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-108-1024x510.png" alt="" class="wp-image-1194268" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-108-1024x510.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/03/image-108-300x149.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-108-768x383.png 768w, https://blog.finxter.com/wp-content/uploads/2023/03/image-108-1536x765.png 1536w, https://blog.finxter.com/wp-content/uploads/2023/03/image-108.png 1600w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>Now that I had visited a few relevant pages from the site and inspected the underlying HTML, I was able to define a general approach:</p>



<p>Scrape movie titles and ratings:</p>



<ol class="wp-block-list">
<li>Loop through each “letter” page and pull the HTML</li>



<li>Use <a rel="noreferrer noopener" href="https://blog.finxter.com/web-scraping-with-beautifulsoup-in-python/" data-type="post" data-id="17311" target="_blank">BeautifulSoup</a> to find all <code>&lt;div></code> tags with <code>class = "et_pb_text_inner"</code></li>



<li>Determine which <code>&lt;div></code> tag contains the list of movies</li>



<li>Get the text from the <code>&lt;div></code> tag and parse it for movies names and information</li>



<li>Loop through each nested<strong> </strong><code>&lt;a></code> tag and get the URL leading to each movie page (the value of the <code>href</code> attribute)</li>
</ol>



<p>Scrape sexual content description:</p>



<ol class="wp-block-list">
<li>Follow the <code>href</code> attribute contained in each <code>&lt;a></code> tag (contains the link to that movie’s page)</li>



<li>Use BeautifulSoup to find all <code>&lt;p></code> tags</li>



<li>Loop through <code>&lt;p></code> tags until I find one that contains the text “SEX/NUDITY”</li>



<li>Extract the text</li>
</ol>



<p>Organize data and save it to a file:</p>



<ol class="wp-block-list">
<li>Build a <a href="https://blog.finxter.com/python-dictionary/" data-type="post" data-id="5232" target="_blank" rel="noreferrer noopener">dictionary</a> containing keys for each piece of information (title, year, rating, etc.)</li>



<li>Convert the dictionary to a pandas data frame</li>



<li>Write the data frame to a CSV file</li>
</ol>



<h2 class="wp-block-heading">Scraping the Movie Titles and Ratings</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="721" height="926" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-115.png" alt="" class="wp-image-1194322" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-115.png 721w, https://blog.finxter.com/wp-content/uploads/2023/03/image-115-234x300.png 234w" sizes="auto, (max-width: 721px) 100vw, 721px" /></figure>
</div>


<p>The <code>import</code> statements needed for the code shown in this section are:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import requests
from bs4 import BeautifulSoup
import time
from urllib.parse import urljoin</pre>



<p>I decided to call the main function <code>scrape_kim_ratings()</code>, and I gave it an input of all of the letter pages I wanted to scrape. </p>



<p>Next, I initialized the dictionary containing all the movie information, which would be converted to a pandas data frame. </p>



<p>The dictionary keys become the data frame column titles after conversion:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def scrape_kim_ratings(letters):
   movie_dict = {"title": [],
                 "year": [],
                 "mpaa": [],
                 "KIM sex": [],
                 "KIM violence": [],
                 "KIM language": [],
                 "KIM sex content": []}
</pre>



<p>Next, I defined a for loop to loop through each letter page and pull the HTML from each page using the <code><a rel="noreferrer noopener" href="https://blog.finxter.com/python-requests-get-the-ultimate-guide/" data-type="post" data-id="37837" target="_blank">requests.get()</a></code> method. Once I had the HTML, I used BeatifulSoup to find all <code>&lt;div></code> tags with an attribute <code>class = "et_pb_text_inner"</code>:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">   for letter in letters:
       # Get a response from each letter page
       url = f"https://kids-in-mind.com/{letter}.htm"
       res = requests.get(url)


       if res:
           # Get the HTML from that page
           soup = BeautifulSoup(res.text, "html.parser")
           # The list of movies is in a div tag with class = et_pb_text_inner
           div = soup.findAll("div", class_="et_pb_text_inner")
</pre>



<p>As it turns out, the letter pages contained multiple tags matching these criteria, so I had to figure out which tag contained the list of movies. </p>



<p>You’ll see that I looped through each of the <code>div</code> tags (<code>for entry in div:</code>), used the <strong>bs4</strong> <code>getText()</code> method to pull the entry&#8217;s text, and looked to see if the text contained <strong>“Movie Reviews by Title.”</strong> </p>



<p>The next tag contained the <a href="https://blog.finxter.com/python-lists/" data-type="post" data-id="7332" target="_blank" rel="noreferrer noopener">list</a> of movies – I had figured this out by inspecting the HTML of a few of the letter pages. In the code below, <strong><code>idx</code> </strong>is the index of the tag containing the list of movies: </p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">           # Find the list of movies. It comes after "Movie Reviews by Title"
           idx = 0
           for entry in div:
               text = entry.getText()
               if "Movie Reviews by Title" in text:
                   idx += 1
                   break
               idx += 1
</pre>



<p>Next, I used the bs4 <code>getText()</code> method to get a string of all the text from the <code>&lt;div></code> tag with the list of movies. The object stored in <code>div[idx]</code> is an instance of the <code>bs4.element.Tag</code> class, which means we can think of it as a <code>&lt;div></code> tag that can be parsed and manipulated with bs4 functions and methods. </p>



<p>You can use Python’s <code><a href="https://blog.finxter.com/python-type/" data-type="post" data-id="23967" target="_blank" rel="noreferrer noopener">type()</a></code> function to determine this. I used the <code>type()</code> function heavily while I was figuring out how the bs4 functions worked and what their outputs were.</p>



<p>All the movies were separated by newline characters, so I used the <code>split()</code> method to get a list containing a different movie in each entry:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">           # All movies on the page, separated by \n 
           # (movie names with ratings are stored as text of the div tag)
           movies = div[idx].getText().split("\n")
</pre>



<p>To be honest, at first, I didn’t know that all the movies were stored as text within the <code>&lt;div></code> tag. I thought I was going to have to pull the text from each <code>&lt;a></code> tag within the <code>&lt;div></code> tag. </p>



<p>However, using the PyCharm debugger to play around with <code>div[idx]</code>, I discovered that pulling the text from the <code>&lt;div></code> tag provided me with the movie information.</p>



<p>Next, I needed to get the links that would take me to each movie page. I used the <code>findAll()</code> method to get all <code>&lt;a></code> tags and then used the <code>urljoin()</code> function to join the URL of the current “letter” web page (like <a href="https://kids-in-mind.com/a.htm" target="_blank" rel="noreferrer noopener">https://kids-in-mind.com/a.htm</a>) with the relative link to the movie page (like /a/abandon.htm). </p>



<p>An example result is <a href="https://kids-in-mind.com/a/abandon.htm" target="_blank" rel="noreferrer noopener">https://kids-in-mind.com/a/abandon.htm</a>. I used <a href="https://blog.finxter.com/list-comprehension/" data-type="post" data-id="1171" target="_blank" rel="noreferrer noopener">list comprehension</a> to put them all in a list, links:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">           # href links to each movie page are stored in a tags
           a = div[idx].findAll("a")
           links = [urljoin(url, x["href"]) for x in a]
</pre>



<p>Now I had all of the movie rating information for a given letter page and all the links to the movie pages. The next steps were to:</p>



<ol class="wp-block-list">
<li>Parse each string in <code>movies</code> for each rating and other pieces of information</li>



<li>Follow each link in <code>links</code> and parse the sexual content</li>
</ol>



<p>To make it easier to loop through both lists at once, I used the <code>zip()</code> function:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">           # zip these up to make iteration easier in the for loop
           movies_and_links = list(zip(movies, links))
</pre>



<p>Next, I looped through each <code>movie</code> and each <code>link</code>. First, I parsed the string in <code>movie</code> for the year, MPAA rating, Kids In Mind ratings, and the movie title using a function that I defined called <code>parse_movie()</code>:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">           for movie, link in movies_and_links:
               # get the information available in the list on each letter page
               year, mpaa, ratings, title = parse_movie(movie)
               print(f"Title is {title}")
</pre>



<p>This function took a bit of trial and error to write.</p>



<p>At first, I thought all of the strings were formatted like, <code>"Abandon [2002] [PG-13] – 4.4.4"</code>. </p>



<p>However, after running the code once, I saw that some of the strings were formatted like this, <code>"Abandon [<em>Foreign Name</em>] [2002] [PG-13] – 4.4.4"</code>, with an additional set of brackets containing the film&#8217;s name in a different language. </p>



<p>I had to add the code block at the very beginning of the function to skip over this set of brackets.</p>



<p>You can see that the two main functions I used were the string methods <code><a href="https://blog.finxter.com/python-string-find/" data-type="post" data-id="26011" target="_blank" rel="noreferrer noopener">find()</a></code> (to find the brackets) and <code><a href="https://blog.finxter.com/python-string-split/" data-type="post" data-id="26097" target="_blank" rel="noreferrer noopener">split()</a></code> (to isolate the Kids In Mind ratings). </p>



<p>The last tricky bit that gave me trouble was that sometimes the Kids In Mind ratings were separated by an <a href="https://www.thepunctuationguide.com/en-dash.html" target="_blank" rel="noreferrer noopener">en dash</a> and other times were separated by an <a href="https://www.thepunctuationguide.com/em-dash.html" target="_blank" rel="noreferrer noopener">em dash</a>: </p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def parse_movie(movie):
   # some entries had a foreign name in brackets
   if movie.count("]") > 2:
       start_idx = movie.find("]") + 1
   else:
       start_idx = 0

   # year is usually in the first set of brackets
   year_idx1 = movie.find("[", start_idx)
   year_idx2 = movie.find("]", start_idx)

   # mpaa rating was next
   mpaa_idx1 = movie.find("[", year_idx1 + 1)
   mpaa_idx2 = movie.find("]", year_idx2 + 1)

   year = int(movie[year_idx1 + 1:year_idx2].strip())
   mpaa = movie[mpaa_idx1 + 1:mpaa_idx2]

   # the ratings came after a dash and were formatted like #.#.#
   ratings_split = movie.split("–")
   # sometimes they used a dash, sometimes an en dash
   if len(ratings_split) == 1:
       ratings_split = movie.split("-")

   ratings = [int(x) for x in ratings_split[-1].split(".")]

   title = movie[0:year_idx1]

   return year, mpaa, ratings, title
</pre>



<h2 class="wp-block-heading">Scrape sexual content description</h2>



<p>The additional import statements needed for the code in this section are:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import bs4.element
import time
import random
</pre>



<p>After parsing <code>movie</code>, it was time to follow the link to the movie’s page and pull a more detailed description of sexual content using the function <code>scrape_kim_sexcontent()</code>. </p>



<p>Since this was going to require making many “<code>get</code>” requests to the Kids In Mind website, I also added a variable time delay in between each request using the <code><a href="https://blog.finxter.com/time-delay-in-python/" data-type="post" data-id="138154" target="_blank" rel="noreferrer noopener">time.sleep()</a></code> function. I did this for two reasons:</p>



<ol class="wp-block-list">
<li>It’s good practice to add some sort of delay between requests so that you do not overload the website’s server.</li>



<li>Adding a bit of random variation to the time delays can trick the web server into thinking your web scraping script is a human, making it less likely to reject subsequent requests.</li>
</ol>



<p>Code:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># follow each movie link to get the sex content description
               start = time.time()
               sex_content = scrape_kim_sexcontent(link)
               delay = time.time() - start

               wait_time = random.uniform(.5, 2) * delay
               print(f'Just finished {title}')
               print(f'wait time is {wait_time}')
               time.sleep(wait_time)
</pre>



<p>Scraping the detailed descriptions proved a bit trickier than getting the <strong><em>Kids In Mind</em></strong> ratings. As I mentioned above, I planned to use the bs4 object method <code>findAll()</code> to get all of the <code>&lt;p></code> tags and find the one that contained sexual content.</p>



<p>Below is the first iteration of my <code>scrape_kim_sexcontent()</code> function:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def scrape_kim_sexcontent(url):
   # Request html from page and find all p tags
   res = requests.get(url)
   soup = BeautifulSoup(res.text, 'html.parser')
   res.close()
   p_set = soup.findAll("p")


   for entry in p_set:
       if 'SEX/NUDITY' in entry.text:
           sex_content = entry.text
           break

return sex_content
</pre>



<p>However, I quickly realized that some of the movie pages were organized differently. The screenshot below shows a resulting CSV file. You can see that the script pulled a paragraph from the right side of the web page instead of the sexual content paragraph.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="876" height="311" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-109.png" alt="" class="wp-image-1194282" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-109.png 876w, https://blog.finxter.com/wp-content/uploads/2023/03/image-109-300x107.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-109-768x273.png 768w" sizes="auto, (max-width: 876px) 100vw, 876px" /></figure>
</div>

<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="511" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-110-1024x511.png" alt="" class="wp-image-1194283" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-110-1024x511.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/03/image-110-300x150.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-110-768x384.png 768w, https://blog.finxter.com/wp-content/uploads/2023/03/image-110-1536x767.png 1536w, https://blog.finxter.com/wp-content/uploads/2023/03/image-110.png 1600w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>It turns out that some of the movie pages, like the one for <em>Abominable</em>, had the title and text “SEX/NUDITY” in an <code>&lt;h2></code> tag preceding the <code>&lt;p></code> tag that contained the detailed description.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="419" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-111-1024x419.png" alt="" class="wp-image-1194284" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-111-1024x419.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/03/image-111-300x123.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-111-768x314.png 768w, https://blog.finxter.com/wp-content/uploads/2023/03/image-111-1536x629.png 1536w, https://blog.finxter.com/wp-content/uploads/2023/03/image-111.png 1600w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>To handle this variation, I added some code. The final version of <code>scrape_kim_sexcontent()</code> is below. First, I looked for all of the <code>&lt;h2></code> tags. Then I looped through them until I found one with an id attribute equal to “sex”. I used the <code>bs4.element.tag</code> attribute, <code>attrs</code>, to access each tag’s attributes as a dictionary.</p>



<p>If you take another look at the <em>Abominable</em> page HTML, you can see that the <code>&lt;p></code> tag containing the sexual content details is at the same level as the preceding <code>&lt;h2></code> tag rather than being nested within it.</p>



<p>This means that the <code>&lt;p></code> tag is a <em>sibling</em> of the <code>&lt;h2></code> tag, not its <em>child</em>. Thus, I was able to access it using the <code>bs4.element.tag</code> attribute <code>next_siblings</code> which returns a list of the siblings that follow the <code>&lt;h2></code> tag. </p>



<p>Finally, I used the <code>bs4.element.tag</code> attribute text to get the paragraph I wanted:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def scrape_kim_sexcontent(url):
   # Request html from page and find all h2 tags
   res = requests.get(url)
   soup = BeautifulSoup(res.text, 'html.parser')
   res.close()
   h2_set = soup.findAll("h2")

   # Initialize
   sex_content = ""

   # Check the &lt;h2> tags (headers). If you find id="sex", grab the next paragraph (p tag)
   sibling_iter = []
   for entry in h2_set:
       if "id" in entry.attrs:
           if entry["id"] == "sex":
               sibling_iter = entry.next_siblings

               # Grab the next paragraph
               for sibling in sibling_iter:
                   if type(sibling) == bs4.element.Tag:
                       sex_content = sibling.text

   # Sometimes header &lt;h2> tags aren't used to make the paragraph headers
   # If you haven't found sex content yet, search all the p tags for "SEX/NUDITY"
   if sex_content == "":
       p_set = soup.findAll("p")

       for entry in p_set:
           if 'SEX/NUDITY' in entry.text:
               sex_content = entry.text
               break

   return sex_content
</pre>



<h2 class="wp-block-heading">Organize Data and Save it to a File</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="974" height="650" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-116.png" alt="" class="wp-image-1194326" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-116.png 974w, https://blog.finxter.com/wp-content/uploads/2023/03/image-116-300x200.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-116-768x513.png 768w" sizes="auto, (max-width: 974px) 100vw, 974px" /></figure>
</div>


<p>The additional import statements needed for the code in this section are:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd
import string
</pre>



<p>Finally, it was time to organize the scraped data and save it to a CSV file. </p>



<p>I decided to use the <a href="https://blog.finxter.com/pandas-quickstart/" data-type="post" data-id="16511" target="_blank" rel="noreferrer noopener">pandas library</a> since its <code>to_csv</code> data frame method makes it super easy to save data to a CSV file. </p>



<p>First, after parsing the information for each movie, I saved each piece of data in a dictionary. After each “letter” page was completed, I converted the growing dictionary to a pandas data frame using the <code>pd.DataFrame()</code> method and then saved the resulting data frame to a CSV file. </p>



<p>I decided to write to the CSV file after each “letter” page was completed to make sure that I would have data saved if the web scraping script was interrupted for some reason: </p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">               # Build dictionary for conversion to data frame
               movie_dict["title"].append(title)
               movie_dict["year"].append(year)
               movie_dict["mpaa"].append(mpaa)
               movie_dict["KIM sex"].append(ratings[0])
               movie_dict["KIM violence"].append(ratings[1])
               movie_dict["KIM language"].append(ratings[2])
               movie_dict["KIM sex content"].append(sex_content)

           res.close()

           # Write to the CSV after every letter
           print("\n")
           print("Writing to Movies.csv")
           df_movies = pd.DataFrame(movie_dict)
           df_movies.to_csv("Movies.csv")

           print(f"Done with {letter}. Waiting {wait_time} seconds")
           time.sleep(wait_time)

       else:
           print(f"Error: {res}")

   return df_movies
</pre>



<p>Lastly, I called the main function <code>scrape_kim_ratings()</code> and provided a list of all the <a href="https://blog.finxter.com/how-to-lowercase-a-string-in-python/" data-type="post" data-id="30102" target="_blank" rel="noreferrer noopener">lowercase</a> ASCII letters:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df_movies = scrape_kim_ratings(string.ascii_lowercase)</pre>



<h2 class="wp-block-heading">Conclusion</h2>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="969" height="643" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-117.png" alt="" class="wp-image-1194328" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-117.png 969w, https://blog.finxter.com/wp-content/uploads/2023/03/image-117-300x199.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-117-768x510.png 768w" sizes="auto, (max-width: 969px) 100vw, 969px" /></figure>



<p>So, there you have it! Here is a link to the GitHub page with the full script <a rel="noreferrer noopener" href="https://github.com/finxter/WebScrapeKidsMovies" data-type="URL" data-id="https://github.com/finxter/WebScrapeKidsMovies" target="_blank">https://github.com/finxter/WebScrapeKidsMovies</a>. I&#8217;ll also attach it at the end of this article.</p>



<p>In the future, I think I will add functions to the script that will pull information from other websites and add them to the current database. I think I will also add a function that checks the websites for any new movies/ratings and adds them to the current database.</p>



<p>I hope this will inspire you to write your own web scraping script!</p>



<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Recommended</strong>: <a href="https://blog.finxter.com/basketball-statistics-page-scraping-using-python-and-beautifulsoup/" data-type="URL" data-id="https://blog.finxter.com/basketball-statistics-page-scraping-using-python-and-beautifulsoup/" target="_blank" rel="noreferrer noopener">Basketball Statistics – Page Scraping Using Python and BeautifulSoup</a></p>



<h2 class="wp-block-heading">The Script</h2>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd
import requests
from bs4 import BeautifulSoup
import bs4.element
import string
import time
from urllib.parse import urljoin
import random


def scrape_kim_sexcontent(url):
    # Request html from page and find all h2 tags
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')
    res.close()
    h2_set = soup.findAll("h2")

    # Initialize
    sex_content = ""

    # Check the &lt;h2> tags (headers). If you find id="sex", grab the next paragraph (p tag)
    sibling_iter = []
    for entry in h2_set:
        if "id" in entry.attrs:
            if entry["id"] == "sex":
                sibling_iter = entry.next_siblings

                # Grab the next paragraph
                for sibling in sibling_iter:
                    if type(sibling) == bs4.element.Tag:
                        sex_content = sibling.text

    # Sometimes header &lt;h2> tags aren't used to make the paragraph headers
    # If you haven't found sex content yet, search all the p tags for "SEX/NUDITY"
    if sex_content == "":
        p_set = soup.findAll("p")

        for entry in p_set:
            if 'SEX/NUDITY' in entry.text:
                sex_content = entry.text
                break

    return sex_content


def parse_movie(movie):
    # some entries had a foreign name in brackets
    if movie.count("]") > 2:
        start_idx = movie.find("]") + 1
    else:
        start_idx = 0

    # year is usually in the first set of brackets
    year_idx1 = movie.find("[", start_idx)
    year_idx2 = movie.find("]", start_idx)

    # mpaa rating was next
    mpaa_idx1 = movie.find("[", year_idx1 + 1)
    mpaa_idx2 = movie.find("]", year_idx2 + 1)

    year = int(movie[year_idx1 + 1:year_idx2].strip())
    mpaa = movie[mpaa_idx1 + 1:mpaa_idx2]

    # the ratings came after a dash and were formatted like #.#.#
    ratings_split = movie.split("–")
    # sometimes they used a dash, sometimes an en dash
    if len(ratings_split) == 1:
        ratings_split = movie.split("-")

    ratings = [int(x) for x in ratings_split[-1].split(".")]

    title = movie[0:year_idx1]

    return year, mpaa, ratings, title


def scrape_kim_ratings(letters):
    movie_dict = {"title": [],
                  "year": [],
                  "mpaa": [],
                  "KIM sex": [],
                  "KIM violence": [],
                  "KIM language": [],
                  "KIM sex content": []}

    for letter in letters:
        # Get a response from each letter page
        url = f"https://kids-in-mind.com/{letter}.htm"
        res = requests.get(url)

        if res:
            # Get the HTML from that page
            soup = BeautifulSoup(res.text, "html.parser")
            # The list of movies is in a div tag with class = et_pb_text_inner
            div = soup.findAll("div", class_="et_pb_text_inner")

            # Find the list of movies. It comes after "Movie Reviews by Title"
            idx = 0
            for entry in div:
                text = entry.getText()
                if "Movie Reviews by Title" in text:
                    idx += 1
                    break
                idx += 1

            # All movies on the page, separated by \n (movie names with ratings are stored as text of the div tag)
            movies = div[idx].getText().split("\n")

            # href links to each movie page are stored in a tags
            a = div[idx].findAll("a")
            links = [urljoin(url, x["href"]) for x in a]

            # zip these up to make iteration easier in the for loop
            movies_and_links = list(zip(movies, links))

            for movie, link in movies_and_links:
                # get the information available in the list on each letter page
                year, mpaa, ratings, title = parse_movie(movie)
                print(f"Title is {title}")

                # follow each movie link to get the sex content description
                start = time.time()
                sex_content = scrape_kim_sexcontent(link)
                delay = time.time() - start

                wait_time = random.uniform(.5, 2) * delay
                print(f'Just finished {title}')
                print(f'wait time is {wait_time}')
                time.sleep(wait_time)

                # Build dictionary for conversion to data frame
                movie_dict["title"].append(title)
                movie_dict["year"].append(year)
                movie_dict["mpaa"].append(mpaa)
                movie_dict["KIM sex"].append(ratings[0])
                movie_dict["KIM violence"].append(ratings[1])
                movie_dict["KIM language"].append(ratings[2])
                movie_dict["KIM sex content"].append(sex_content)

            res.close()

            # Write to the CSV after every letter
            print("\n")
            print("Writing to Movies.csv")
            df_movies = pd.DataFrame(movie_dict)
            df_movies.to_csv("Movies.csv")

            print(f"Done with {letter}. Waiting {wait_time} seconds")
            time.sleep(wait_time)

        else:
            print(f"Error: {res}")

    return df_movies


df_movies = scrape_kim_ratings(string.ascii_lowercase)</pre>
<p>The post <a href="https://blog.finxter.com/i-built-a-kids-movie-ratings-database-using-beautiful-soup/">I Built a Kids&#8217; Movie Ratings Database Using Beautiful Soup</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Python &#8211; How to Convert KML to CSV?</title>
		<link>https://blog.finxter.com/python-how-to-convert-kml-to-csv/</link>
		
		<dc:creator><![CDATA[Chris]]></dc:creator>
		<pubDate>Thu, 18 Aug 2022 15:19:59 +0000</pubDate>
				<category><![CDATA[BeautifulSoup]]></category>
		<category><![CDATA[CSV]]></category>
		<category><![CDATA[Input/Output]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[XML]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=587850</guid>

					<description><![CDATA[<p>What is KML? ℹ️ Definition: The Keyhole Markup Language (KML) is a file format for displaying geographic data in Google Earth or other so-called &#8220;Earth Browsers&#8221;. Similarly to XML, KML uses a tag-based structure with nested elements and attributes. How to Convert KML to CSV in Python? You can convert a .kml to a .csv ... <a title="Python &#8211; How to Convert KML to CSV?" class="read-more" href="https://blog.finxter.com/python-how-to-convert-kml-to-csv/" aria-label="Read more about Python &#8211; How to Convert KML to CSV?">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/python-how-to-convert-kml-to-csv/">Python &#8211; How to Convert KML to CSV?</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading">What is KML?</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="352" height="263" src="https://blog.finxter.com/wp-content/uploads/2022/08/image-45.png" alt="" class="wp-image-587952" srcset="https://blog.finxter.com/wp-content/uploads/2022/08/image-45.png 352w, https://blog.finxter.com/wp-content/uploads/2022/08/image-45-300x224.png 300w" sizes="auto, (max-width: 352px) 100vw, 352px" /></figure>
</div>


<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2139.png" alt="ℹ" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Definition</strong>: The <a href="https://developers.google.com/kml/documentation/kml_tut" data-type="URL" data-id="https://developers.google.com/kml/documentation/kml_tut" target="_blank" rel="noreferrer noopener">Keyhole Markup Language</a> (KML) is a file format for displaying geographic data in Google Earth or other so-called &#8220;Earth Browsers&#8221;. Similarly to XML, KML uses a tag-based structure with nested elements and attributes. </p>



<h2 class="wp-block-heading">How to Convert KML to CSV in Python?</h2>



<p class="has-global-color-8-background-color has-background">You can convert a <code>.kml</code> to a <code>.csv</code> file in Python by using the <a rel="noreferrer noopener" href="https://blog.finxter.com/python-beautifulsoup-xml-to-dict-json-dataframe-csv/" data-type="post" data-id="474965" target="_blank">BeautifulSoup</a> and the <code>csv</code> libraries. You use the former to read the XML-structured KML file and the latter to write the CSV file row by row. </p>



<p>Here&#8217;s the code example inspired but modified from <a href="https://gist.github.com/mciantyre/32ff2c2d5cd9515c1ee7" data-type="URL" data-id="https://gist.github.com/mciantyre/32ff2c2d5cd9515c1ee7" target="_blank" rel="noreferrer noopener">this</a> GitHub repository. You can copy&amp;paste it in the directory where your KML file resides and change the input and output filenames at the beginning to convert your own KML to a CSV in Python:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from bs4 import BeautifulSoup
import csv


infile = 'my_file.kml'
outfile = 'my_file.csv'


with open(infile, 'r') as f:
    s = BeautifulSoup(f, 'xml')
    
    with open(outfile, 'wb') as csvfile:
        writer = csv.writer(csvfile)

        for coords in s.find_all('coordinates'):
            
            # Take coordinate string from KML and break it up into [Lat,Lon,Lat,Lon...] to get CSV row
            space_splits = coords.string.split(" ")
            row = []
            
            for split in space_splits[1:]:
                # Note: because of the space between &lt;coordinates>" "-80.123, we slice [1:]
                comma_split = split.split(',')

                # lattitude
                row.append(comma_split[1])
                
                # longitude
                row.append(comma_split[0])
            
            writer.writerow(row)
</pre>



<h2 class="wp-block-heading">Example Conversion</h2>



<p>We use the following <a href="https://developers.google.com/kml/documentation/kml_tut" data-type="URL" data-id="https://developers.google.com/kml/documentation/kml_tut" target="_blank" rel="noreferrer noopener">sample</a> KML file as <code>'my_file.kml'</code>:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;kml xmlns="http://www.opengis.net/kml/2.2">
  &lt;Document>
    &lt;name>KML Samples&lt;/name>
    &lt;open>1&lt;/open>
    &lt;description>Unleash your creativity with the help of these examples!&lt;/description>
    &lt;Style id="downArrowIcon">
      &lt;IconStyle>
        &lt;Icon>
          &lt;href>http://maps.google.com/mapfiles/kml/pal4/icon28.png&lt;/href>
        &lt;/Icon>
      &lt;/IconStyle>
    &lt;/Style>
    &lt;Style id="globeIcon">
      &lt;IconStyle>
        &lt;Icon>
          &lt;href>http://maps.google.com/mapfiles/kml/pal3/icon19.png&lt;/href>
        &lt;/Icon>
      &lt;/IconStyle>
      &lt;LineStyle>
        &lt;width>2&lt;/width>
      &lt;/LineStyle>
    &lt;/Style>
    &lt;Style id="transPurpleLineGreenPoly">
      &lt;LineStyle>
        &lt;color>7fff00ff&lt;/color>
        &lt;width>4&lt;/width>
      &lt;/LineStyle>
      &lt;PolyStyle>
        &lt;color>7f00ff00&lt;/color>
      &lt;/PolyStyle>
    &lt;/Style>
    &lt;Style id="yellowLineGreenPoly">
      &lt;LineStyle>
        &lt;color>7f00ffff&lt;/color>
        &lt;width>4&lt;/width>
      &lt;/LineStyle>
      &lt;PolyStyle>
        &lt;color>7f00ff00&lt;/color>
      &lt;/PolyStyle>
    &lt;/Style>
    &lt;Style id="thickBlackLine">
      &lt;LineStyle>
        &lt;color>87000000&lt;/color>
        &lt;width>10&lt;/width>
      &lt;/LineStyle>
    &lt;/Style>
    &lt;Style id="redLineBluePoly">
      &lt;LineStyle>
        &lt;color>ff0000ff&lt;/color>
      &lt;/LineStyle>
      &lt;PolyStyle>
        &lt;color>ffff0000&lt;/color>
      &lt;/PolyStyle>
    &lt;/Style>
    &lt;Style id="blueLineRedPoly">
      &lt;LineStyle>
        &lt;color>ffff0000&lt;/color>
      &lt;/LineStyle>
      &lt;PolyStyle>
        &lt;color>ff0000ff&lt;/color>
      &lt;/PolyStyle>
    &lt;/Style>
    &lt;Style id="transRedPoly">
      &lt;LineStyle>
        &lt;width>1.5&lt;/width>
      &lt;/LineStyle>
      &lt;PolyStyle>
        &lt;color>7d0000ff&lt;/color>
      &lt;/PolyStyle>
    &lt;/Style>
    &lt;Style id="transBluePoly">
      &lt;LineStyle>
        &lt;width>1.5&lt;/width>
      &lt;/LineStyle>
      &lt;PolyStyle>
        &lt;color>7dff0000&lt;/color>
      &lt;/PolyStyle>
    &lt;/Style>
    &lt;Style id="transGreenPoly">
      &lt;LineStyle>
        &lt;width>1.5&lt;/width>
      &lt;/LineStyle>
      &lt;PolyStyle>
        &lt;color>7d00ff00&lt;/color>
      &lt;/PolyStyle>
    &lt;/Style>
    &lt;Style id="transYellowPoly">
      &lt;LineStyle>
        &lt;width>1.5&lt;/width>
      &lt;/LineStyle>
      &lt;PolyStyle>
        &lt;color>7d00ffff&lt;/color>
      &lt;/PolyStyle>
    &lt;/Style>
    &lt;Style id="noDrivingDirections">
      &lt;BalloonStyle>
        &lt;text>&lt;![CDATA[
          &lt;b>$[name]&lt;/b>
          &lt;br />&lt;br />
          $[description]
        ]]&gt;&lt;/text>
      &lt;/BalloonStyle>
    &lt;/Style>
    &lt;Folder>
      &lt;name>Placemarks&lt;/name>
      &lt;description>These are just some of the different kinds of placemarks with
        which you can mark your favorite places&lt;/description>
      &lt;LookAt>
        &lt;longitude>-122.0839597145766&lt;/longitude>
        &lt;latitude>37.42222904525232&lt;/latitude>
        &lt;altitude>0&lt;/altitude>
        &lt;heading>-148.4122922628044&lt;/heading>
        &lt;tilt>40.5575073395506&lt;/tilt>
        &lt;range>500.6566641072245&lt;/range>
      &lt;/LookAt>
      &lt;Placemark>
        &lt;name>Simple placemark&lt;/name>
        &lt;description>Attached to the ground. Intelligently places itself at the
          height of the underlying terrain.&lt;/description>
        &lt;Point>
          &lt;coordinates>-122.0822035425683,37.42228990140251,0&lt;/coordinates>
        &lt;/Point>
      &lt;/Placemark>
      &lt;Placemark>
        &lt;name>Floating placemark&lt;/name>
        &lt;visibility>0&lt;/visibility>
        &lt;description>Floats a defined distance above the ground.&lt;/description>
        &lt;LookAt>
          &lt;longitude>-122.0839597145766&lt;/longitude>
          &lt;latitude>37.42222904525232&lt;/latitude>
          &lt;altitude>0&lt;/altitude>
          &lt;heading>-148.4122922628044&lt;/heading>
          &lt;tilt>40.5575073395506&lt;/tilt>
          &lt;range>500.6566641072245&lt;/range>
        &lt;/LookAt>
        &lt;styleUrl>#downArrowIcon&lt;/styleUrl>
        &lt;Point>
          &lt;altitudeMode>relativeToGround&lt;/altitudeMode>
          &lt;coordinates>-122.084075,37.4220033612141,50&lt;/coordinates>
        &lt;/Point>
      &lt;/Placemark>
      &lt;Placemark>
        &lt;name>Extruded placemark&lt;/name>
        &lt;visibility>0&lt;/visibility>
        &lt;description>Tethered to the ground by a customizable
          &amp;quot;tail&amp;quot;&lt;/description>
        &lt;LookAt>
          &lt;longitude>-122.0845787421525&lt;/longitude>
          &lt;latitude>37.42215078737763&lt;/latitude>
          &lt;altitude>0&lt;/altitude>
          &lt;heading>-148.4126684946234&lt;/heading>
          &lt;tilt>40.55750733918048&lt;/tilt>
          &lt;range>365.2646606980322&lt;/range>
        &lt;/LookAt>
        &lt;styleUrl>#globeIcon&lt;/styleUrl>
        &lt;Point>
          &lt;extrude>1&lt;/extrude>
          &lt;altitudeMode>relativeToGround&lt;/altitudeMode>
          &lt;coordinates>-122.0857667006183,37.42156927867553,50&lt;/coordinates>
        &lt;/Point>
      &lt;/Placemark>
    &lt;/Folder>
    &lt;Folder>
      &lt;name>Styles and Markup&lt;/name>
      &lt;visibility>0&lt;/visibility>
      &lt;description>With KML it is easy to create rich, descriptive markup to
        annotate and enrich your placemarks&lt;/description>
      &lt;LookAt>
        &lt;longitude>-122.0845787422371&lt;/longitude>
        &lt;latitude>37.42215078726837&lt;/latitude>
        &lt;altitude>0&lt;/altitude>
        &lt;heading>-148.4126777488172&lt;/heading>
        &lt;tilt>40.55750733930874&lt;/tilt>
        &lt;range>365.2646826292919&lt;/range>
      &lt;/LookAt>
      &lt;styleUrl>#noDrivingDirections&lt;/styleUrl>
      &lt;Document>
        &lt;name>Highlighted Icon&lt;/name>
        &lt;visibility>0&lt;/visibility>
        &lt;description>Place your mouse over the icon to see it display the new
          icon&lt;/description>
        &lt;LookAt>
          &lt;longitude>-122.0856552124024&lt;/longitude>
          &lt;latitude>37.4224281311035&lt;/latitude>
          &lt;altitude>0&lt;/altitude>
          &lt;heading>0&lt;/heading>
          &lt;tilt>0&lt;/tilt>
          &lt;range>265.8520424250024&lt;/range>
        &lt;/LookAt>
        &lt;Style id="highlightPlacemark">
          &lt;IconStyle>
            &lt;Icon>
              &lt;href>http://maps.google.com/mapfiles/kml/paddle/red-stars.png&lt;/href>
            &lt;/Icon>
          &lt;/IconStyle>
        &lt;/Style>
        &lt;Style id="normalPlacemark">
          &lt;IconStyle>
            &lt;Icon>
              &lt;href>http://maps.google.com/mapfiles/kml/paddle/wht-blank.png&lt;/href>
            &lt;/Icon>
          &lt;/IconStyle>
        &lt;/Style>
        &lt;StyleMap id="exampleStyleMap">
          &lt;Pair>
            &lt;key>normal&lt;/key>
            &lt;styleUrl>#normalPlacemark&lt;/styleUrl>
          &lt;/Pair>
          &lt;Pair>
            &lt;key>highlight&lt;/key>
            &lt;styleUrl>#highlightPlacemark&lt;/styleUrl>
          &lt;/Pair>
        &lt;/StyleMap>
        &lt;Placemark>
          &lt;name>Roll over this icon&lt;/name>
          &lt;visibility>0&lt;/visibility>
          &lt;styleUrl>#exampleStyleMap&lt;/styleUrl>
          &lt;Point>
            &lt;coordinates>-122.0856545755255,37.42243077405461,0&lt;/coordinates>
          &lt;/Point>
        &lt;/Placemark>
      &lt;/Document>
      &lt;Placemark>
        &lt;name>Descriptive HTML&lt;/name>
        &lt;visibility>0&lt;/visibility>
        &lt;description>&lt;![CDATA[Click on the blue link!&lt;br>&lt;br>
Placemark descriptions can be enriched by using many standard HTML tags.&lt;br>
For example:
&lt;hr>
Styles:&lt;br>
&lt;i>Italics&lt;/i>, 
&lt;b>Bold&lt;/b>, 
&lt;u>Underlined&lt;/u>, 
&lt;s>Strike Out&lt;/s>, 
subscript&lt;sub>subscript&lt;/sub>, 
superscript&lt;sup>superscript&lt;/sup>, 
&lt;big>Big&lt;/big>, 
&lt;small>Small&lt;/small>, 
&lt;tt>Typewriter&lt;/tt>, 
&lt;em>Emphasized&lt;/em>, 
&lt;strong>Strong&lt;/strong>, 
&lt;code>Code&lt;/code>
&lt;hr>
Fonts:&lt;br> 
&lt;font color="red">red by name&lt;/font>, 
&lt;font color="#408010">leaf green by hexadecimal RGB&lt;/font>
&lt;br>
&lt;font size=1>size 1&lt;/font>, 
&lt;font size=2>size 2&lt;/font>, 
&lt;font size=3>size 3&lt;/font>, 
&lt;font size=4>size 4&lt;/font>, 
&lt;font size=5>size 5&lt;/font>, 
&lt;font size=6>size 6&lt;/font>, 
&lt;font size=7>size 7&lt;/font>
&lt;br>
&lt;font face=times>Times&lt;/font>, 
&lt;font face=verdana>Verdana&lt;/font>, 
&lt;font face=arial>Arial&lt;/font>&lt;br>
&lt;hr>
Links: 
&lt;br>
&lt;a href="http://earth.google.com/">Google Earth!&lt;/a>
&lt;br>
 or:  Check out our website at www.google.com
&lt;hr>
Alignment:&lt;br>
&lt;p align=left>left&lt;/p>
&lt;p align=center>center&lt;/p>
&lt;p align=right>right&lt;/p>
&lt;hr>
Ordered Lists:&lt;br>
&lt;ol>&lt;li>First&lt;/li>&lt;li>Second&lt;/li>&lt;li>Third&lt;/li>&lt;/ol>
&lt;ol type="a">&lt;li>First&lt;/li>&lt;li>Second&lt;/li>&lt;li>Third&lt;/li>&lt;/ol>
&lt;ol type="A">&lt;li>First&lt;/li>&lt;li>Second&lt;/li>&lt;li>Third&lt;/li>&lt;/ol>
&lt;hr>
Unordered Lists:&lt;br>
&lt;ul>&lt;li>A&lt;/li>&lt;li>B&lt;/li>&lt;li>C&lt;/li>&lt;/ul>
&lt;ul type="circle">&lt;li>A&lt;/li>&lt;li>B&lt;/li>&lt;li>C&lt;/li>&lt;/ul>
&lt;ul type="square">&lt;li>A&lt;/li>&lt;li>B&lt;/li>&lt;li>C&lt;/li>&lt;/ul>
&lt;hr>
Definitions:&lt;br>
&lt;dl>
&lt;dt>Google:&lt;/dt>&lt;dd>The best thing since sliced bread&lt;/dd>
&lt;/dl>
&lt;hr>
Centered:&lt;br>&lt;center>
Time present and time past&lt;br>
Are both perhaps present in time future,&lt;br>
And time future contained in time past.&lt;br>
If all time is eternally present&lt;br>
All time is unredeemable.&lt;br>
&lt;/center>
&lt;hr>
Block Quote:
&lt;br>
&lt;blockquote>
We shall not cease from exploration&lt;br>
And the end of all our exploring&lt;br>
Will be to arrive where we started&lt;br>
And know the place for the first time.&lt;br>
&lt;i>-- T.S. Eliot&lt;/i>
&lt;/blockquote>
&lt;br>
&lt;hr>
Headings:&lt;br>
&lt;h1>Header 1&lt;/h1>
&lt;h2>Header 2&lt;/h2>
&lt;h3>Header 3&lt;/h3>
&lt;h3>Header 4&lt;/h4>
&lt;h3>Header 5&lt;/h5>
&lt;hr>
Images:&lt;br>
&lt;i>Remote image&lt;/i>&lt;br>
&lt;img src="//developers.google.com/kml/documentation/images/googleSample.png">&lt;br>
&lt;i>Scaled image&lt;/i>&lt;br>
&lt;img src="//developers.google.com/kml/documentation/images/googleSample.png" width=100>&lt;br>
&lt;hr>
Simple Tables:&lt;br>
&lt;table border="1" padding="1">
&lt;tr>&lt;td>1&lt;/td>&lt;td>2&lt;/td>&lt;td>3&lt;/td>&lt;td>4&lt;/td>&lt;td>5&lt;/td>&lt;/tr>
&lt;tr>&lt;td>a&lt;/td>&lt;td>b&lt;/td>&lt;td>c&lt;/td>&lt;td>d&lt;/td>&lt;td>e&lt;/td>&lt;/tr>
&lt;/table>
&lt;br>
[Did you notice that double-clicking on the placemark doesn't cause the viewer to take you anywhere? This is because it is possible to directly author a "placeless placemark". If you look at the code for this example, you will see that it has neither a point coordinate nor a LookAt element.]]]&gt;&lt;/description>
      &lt;/Placemark>
    &lt;/Folder>
    &lt;Folder>
      &lt;name>Ground Overlays&lt;/name>
      &lt;visibility>0&lt;/visibility>
      &lt;description>Examples of ground overlays&lt;/description>
      &lt;GroundOverlay>
        &lt;name>Large-scale overlay on terrain&lt;/name>
        &lt;visibility>0&lt;/visibility>
        &lt;description>Overlay shows Mount Etna erupting on July 13th, 2001.&lt;/description>
        &lt;LookAt>
          &lt;longitude>15.02468937557116&lt;/longitude>
          &lt;latitude>37.67395167941667&lt;/latitude>
          &lt;altitude>0&lt;/altitude>
          &lt;heading>-16.5581842842829&lt;/heading>
          &lt;tilt>58.31228652890705&lt;/tilt>
          &lt;range>30350.36838438907&lt;/range>
        &lt;/LookAt>
        &lt;Icon>
          &lt;href>http://developers.google.com/kml/documentation/images/etna.jpg&lt;/href>
        &lt;/Icon>
        &lt;LatLonBox>
          &lt;north>37.91904192681665&lt;/north>
          &lt;south>37.46543388598137&lt;/south>
          &lt;east>15.35832653742206&lt;/east>
          &lt;west>14.60128369746704&lt;/west>
          &lt;rotation>-0.1556640799496235&lt;/rotation>
        &lt;/LatLonBox>
      &lt;/GroundOverlay>
    &lt;/Folder>
    &lt;Folder>
      &lt;name>Screen Overlays&lt;/name>
      &lt;visibility>0&lt;/visibility>
      &lt;description>Screen overlays have to be authored directly in KML. These
        examples illustrate absolute and dynamic positioning in screen space.&lt;/description>
      &lt;ScreenOverlay>
        &lt;name>Simple crosshairs&lt;/name>
        &lt;visibility>0&lt;/visibility>
        &lt;description>This screen overlay uses fractional positioning to put the
          image in the exact center of the screen&lt;/description>
        &lt;Icon>
          &lt;href>http://developers.google.com/kml/documentation/images/crosshairs.png&lt;/href>
        &lt;/Icon>
        &lt;overlayXY x="0.5" y="0.5" xunits="fraction" yunits="fraction"/>
        &lt;screenXY x="0.5" y="0.5" xunits="fraction" yunits="fraction"/>
        &lt;rotationXY x="0.5" y="0.5" xunits="fraction" yunits="fraction"/>
        &lt;size x="0" y="0" xunits="pixels" yunits="pixels"/>
      &lt;/ScreenOverlay>
      &lt;ScreenOverlay>
        &lt;name>Absolute Positioning: Top left&lt;/name>
        &lt;visibility>0&lt;/visibility>
        &lt;Icon>
          &lt;href>http://developers.google.com/kml/documentation/images/top_left.jpg&lt;/href>
        &lt;/Icon>
        &lt;overlayXY x="0" y="1" xunits="fraction" yunits="fraction"/>
        &lt;screenXY x="0" y="1" xunits="fraction" yunits="fraction"/>
        &lt;rotationXY x="0" y="0" xunits="fraction" yunits="fraction"/>
        &lt;size x="0" y="0" xunits="fraction" yunits="fraction"/>
      &lt;/ScreenOverlay>
      &lt;ScreenOverlay>
        &lt;name>Absolute Positioning: Top right&lt;/name>
        &lt;visibility>0&lt;/visibility>
        &lt;Icon>
          &lt;href>http://developers.google.com/kml/documentation/images/top_right.jpg&lt;/href>
        &lt;/Icon>
        &lt;overlayXY x="1" y="1" xunits="fraction" yunits="fraction"/>
        &lt;screenXY x="1" y="1" xunits="fraction" yunits="fraction"/>
        &lt;rotationXY x="0" y="0" xunits="fraction" yunits="fraction"/>
        &lt;size x="0" y="0" xunits="fraction" yunits="fraction"/>
      &lt;/ScreenOverlay>
      &lt;ScreenOverlay>
        &lt;name>Absolute Positioning: Bottom left&lt;/name>
        &lt;visibility>0&lt;/visibility>
        &lt;Icon>
          &lt;href>http://developers.google.com/kml/documentation/images/bottom_left.jpg&lt;/href>
        &lt;/Icon>
        &lt;overlayXY x="0" y="-1" xunits="fraction" yunits="fraction"/>
        &lt;screenXY x="0" y="0" xunits="fraction" yunits="fraction"/>
        &lt;rotationXY x="0" y="0" xunits="fraction" yunits="fraction"/>
        &lt;size x="0" y="0" xunits="fraction" yunits="fraction"/>
      &lt;/ScreenOverlay>
      &lt;ScreenOverlay>
        &lt;name>Absolute Positioning: Bottom right&lt;/name>
        &lt;visibility>0&lt;/visibility>
        &lt;Icon>
          &lt;href>http://developers.google.com/kml/documentation/images/bottom_right.jpg&lt;/href>
        &lt;/Icon>
        &lt;overlayXY x="1" y="-1" xunits="fraction" yunits="fraction"/>
        &lt;screenXY x="1" y="0" xunits="fraction" yunits="fraction"/>
        &lt;rotationXY x="0" y="0" xunits="fraction" yunits="fraction"/>
        &lt;size x="0" y="0" xunits="fraction" yunits="fraction"/>
      &lt;/ScreenOverlay>
      &lt;ScreenOverlay>
        &lt;name>Dynamic Positioning: Top of screen&lt;/name>
        &lt;visibility>0&lt;/visibility>
        &lt;Icon>
          &lt;href>http://developers.google.com/kml/documentation/images/dynamic_screenoverlay.jpg&lt;/href>
        &lt;/Icon>
        &lt;overlayXY x="0" y="1" xunits="fraction" yunits="fraction"/>
        &lt;screenXY x="0" y="1" xunits="fraction" yunits="fraction"/>
        &lt;rotationXY x="0" y="0" xunits="fraction" yunits="fraction"/>
        &lt;size x="1" y="0.2" xunits="fraction" yunits="fraction"/>
      &lt;/ScreenOverlay>
      &lt;ScreenOverlay>
        &lt;name>Dynamic Positioning: Right of screen&lt;/name>
        &lt;visibility>0&lt;/visibility>
        &lt;Icon>
          &lt;href>http://developers.google.com/kml/documentation/images/dynamic_right.jpg&lt;/href>
        &lt;/Icon>
        &lt;overlayXY x="1" y="1" xunits="fraction" yunits="fraction"/>
        &lt;screenXY x="1" y="1" xunits="fraction" yunits="fraction"/>
        &lt;rotationXY x="0" y="0" xunits="fraction" yunits="fraction"/>
        &lt;size x="0" y="1" xunits="fraction" yunits="fraction"/>
      &lt;/ScreenOverlay>
    &lt;/Folder>
    &lt;Folder>
      &lt;name>Paths&lt;/name>
      &lt;visibility>0&lt;/visibility>
      &lt;description>Examples of paths. Note that the tessellate tag is by default
        set to 0. If you want to create tessellated lines, they must be authored
        (or edited) directly in KML.&lt;/description>
      &lt;Placemark>
        &lt;name>Tessellated&lt;/name>
        &lt;visibility>0&lt;/visibility>
        &lt;description>&lt;![CDATA[If the &lt;tessellate> tag has a value of 1, the line will contour to the underlying terrain]]&gt;&lt;/description>
        &lt;LookAt>
          &lt;longitude>-112.0822680013139&lt;/longitude>
          &lt;latitude>36.09825589333556&lt;/latitude>
          &lt;altitude>0&lt;/altitude>
          &lt;heading>103.8120432044965&lt;/heading>
          &lt;tilt>62.04855796276328&lt;/tilt>
          &lt;range>2889.145007690472&lt;/range>
        &lt;/LookAt>
        &lt;LineString>
          &lt;tessellate>1&lt;/tessellate>
          &lt;coordinates> -112.0814237830345,36.10677870477137,0
            -112.0870267752693,36.0905099328766,0 &lt;/coordinates>
        &lt;/LineString>
      &lt;/Placemark>
      &lt;Placemark>
        &lt;name>Untessellated&lt;/name>
        &lt;visibility>0&lt;/visibility>
        &lt;description>&lt;![CDATA[If the &lt;tessellate> tag has a value of 0, the line follow a simple straight-line path from point to point]]&gt;&lt;/description>
        &lt;LookAt>
          &lt;longitude>-112.0822680013139&lt;/longitude>
          &lt;latitude>36.09825589333556&lt;/latitude>
          &lt;altitude>0&lt;/altitude>
          &lt;heading>103.8120432044965&lt;/heading>
          &lt;tilt>62.04855796276328&lt;/tilt>
          &lt;range>2889.145007690472&lt;/range>
        &lt;/LookAt>
        &lt;LineString>
          &lt;tessellate>0&lt;/tessellate>
          &lt;coordinates> -112.080622229595,36.10673460007995,0
            -112.085242575315,36.09049598612422,0 &lt;/coordinates>
        &lt;/LineString>
      &lt;/Placemark>
      &lt;Placemark>
        &lt;name>Absolute&lt;/name>
        &lt;visibility>0&lt;/visibility>
        &lt;description>Transparent purple line&lt;/description>
        &lt;LookAt>
          &lt;longitude>-112.2719329043177&lt;/longitude>
          &lt;latitude>36.08890633450894&lt;/latitude>
          &lt;altitude>0&lt;/altitude>
          &lt;heading>-106.8161545998597&lt;/heading>
          &lt;tilt>44.60763714063257&lt;/tilt>
          &lt;range>2569.386744398339&lt;/range>
        &lt;/LookAt>
        &lt;styleUrl>#transPurpleLineGreenPoly&lt;/styleUrl>
        &lt;LineString>
          &lt;tessellate>1&lt;/tessellate>
          &lt;altitudeMode>absolute&lt;/altitudeMode>
          &lt;coordinates> -112.265654928602,36.09447672602546,2357
            -112.2660384528238,36.09342608838671,2357
            -112.2668139013453,36.09251058776881,2357
            -112.2677826834445,36.09189827357996,2357
            -112.2688557510952,36.0913137941187,2357
            -112.2694810717219,36.0903677207521,2357
            -112.2695268555611,36.08932171487285,2357
            -112.2690144567276,36.08850916060472,2357
            -112.2681528815339,36.08753813597956,2357
            -112.2670588176031,36.08682685262568,2357
            -112.2657374587321,36.08646312301303,2357 &lt;/coordinates>
        &lt;/LineString>
      &lt;/Placemark>
      &lt;Placemark>
        &lt;name>Absolute Extruded&lt;/name>
        &lt;visibility>0&lt;/visibility>
        &lt;description>Transparent green wall with yellow outlines&lt;/description>
        &lt;LookAt>
          &lt;longitude>-112.2643334742529&lt;/longitude>
          &lt;latitude>36.08563154742419&lt;/latitude>
          &lt;altitude>0&lt;/altitude>
          &lt;heading>-125.7518698668815&lt;/heading>
          &lt;tilt>44.61038665812578&lt;/tilt>
          &lt;range>4451.842204068102&lt;/range>
        &lt;/LookAt>
        &lt;styleUrl>#yellowLineGreenPoly&lt;/styleUrl>
        &lt;LineString>
          &lt;extrude>1&lt;/extrude>
          &lt;tessellate>1&lt;/tessellate>
          &lt;altitudeMode>absolute&lt;/altitudeMode>
          &lt;coordinates> -112.2550785337791,36.07954952145647,2357
            -112.2549277039738,36.08117083492122,2357
            -112.2552505069063,36.08260761307279,2357
            -112.2564540158376,36.08395660588506,2357
            -112.2580238976449,36.08511401044813,2357
            -112.2595218489022,36.08584355239394,2357
            -112.2608216347552,36.08612634548589,2357
            -112.262073428656,36.08626019085147,2357
            -112.2633204928495,36.08621519860091,2357
            -112.2644963846444,36.08627897945274,2357
            -112.2656969554589,36.08649599090644,2357 &lt;/coordinates>
        &lt;/LineString>
      &lt;/Placemark>
      &lt;Placemark>
        &lt;name>Relative&lt;/name>
        &lt;visibility>0&lt;/visibility>
        &lt;description>Black line (10 pixels wide), height tracks terrain&lt;/description>
        &lt;LookAt>
          &lt;longitude>-112.2580438551384&lt;/longitude>
          &lt;latitude>36.1072674824385&lt;/latitude>
          &lt;altitude>0&lt;/altitude>
          &lt;heading>4.947421249553717&lt;/heading>
          &lt;tilt>44.61324882043339&lt;/tilt>
          &lt;range>2927.61105910266&lt;/range>
        &lt;/LookAt>
        &lt;styleUrl>#thickBlackLine&lt;/styleUrl>
        &lt;LineString>
          &lt;tessellate>1&lt;/tessellate>
          &lt;altitudeMode>relativeToGround&lt;/altitudeMode>
          &lt;coordinates> -112.2532845153347,36.09886943729116,645
            -112.2540466121145,36.09919570465255,645
            -112.254734666947,36.09984998366178,645
            -112.255493345654,36.10051310621746,645
            -112.2563157098468,36.10108441943419,645
            -112.2568033076439,36.10159722088088,645
            -112.257494011321,36.10204323542867,645
            -112.2584106072308,36.10229131995655,645
            -112.2596588987972,36.10240001286358,645
            -112.2610581199487,36.10213176873407,645
            -112.2626285262793,36.10157011437219,645 &lt;/coordinates>
        &lt;/LineString>
      &lt;/Placemark>
      &lt;Placemark>
        &lt;name>Relative Extruded&lt;/name>
        &lt;visibility>0&lt;/visibility>
        &lt;description>Opaque blue walls with red outline, height tracks terrain&lt;/description>
        &lt;LookAt>
          &lt;longitude>-112.2683594333433&lt;/longitude>
          &lt;latitude>36.09884362144909&lt;/latitude>
          &lt;altitude>0&lt;/altitude>
          &lt;heading>-72.24271551768405&lt;/heading>
          &lt;tilt>44.60855445139561&lt;/tilt>
          &lt;range>2184.193522571467&lt;/range>
        &lt;/LookAt>
        &lt;styleUrl>#redLineBluePoly&lt;/styleUrl>
        &lt;LineString>
          &lt;extrude>1&lt;/extrude>
          &lt;tessellate>1&lt;/tessellate>
          &lt;altitudeMode>relativeToGround&lt;/altitudeMode>
          &lt;coordinates> -112.2656634181359,36.09445214722695,630
            -112.2652238941097,36.09520916122063,630
            -112.2645079986395,36.09580763864907,630
            -112.2638827428817,36.09628572284063,630
            -112.2635746835406,36.09679275951239,630
            -112.2635711822407,36.09740038871899,630
            -112.2640296531825,36.09804913435539,630
            -112.264327720538,36.09880337400301,630
            -112.2642436562271,36.09963644790288,630
            -112.2639148687042,36.10055381117246,630
            -112.2626894973474,36.10149062823369,630 &lt;/coordinates>
        &lt;/LineString>
      &lt;/Placemark>
    &lt;/Folder>
    &lt;Folder>
      &lt;name>Polygons&lt;/name>
      &lt;visibility>0&lt;/visibility>
      &lt;description>Examples of polygon shapes&lt;/description>
      &lt;Folder>
        &lt;name>Google Campus&lt;/name>
        &lt;visibility>0&lt;/visibility>
        &lt;description>A collection showing how easy it is to create 3-dimensional
          buildings&lt;/description>
        &lt;LookAt>
          &lt;longitude>-122.084120030116&lt;/longitude>
          &lt;latitude>37.42174011925477&lt;/latitude>
          &lt;altitude>0&lt;/altitude>
          &lt;heading>-34.82469740081282&lt;/heading>
          &lt;tilt>53.454348562403&lt;/tilt>
          &lt;range>276.7870053764046&lt;/range>
        &lt;/LookAt>
        &lt;Placemark>
          &lt;name>Building 40&lt;/name>
          &lt;visibility>0&lt;/visibility>
          &lt;styleUrl>#transRedPoly&lt;/styleUrl>
          &lt;Polygon>
            &lt;extrude>1&lt;/extrude>
            &lt;altitudeMode>relativeToGround&lt;/altitudeMode>
            &lt;outerBoundaryIs>
              &lt;LinearRing>
                &lt;coordinates> -122.0848938459612,37.42257124044786,17
                  -122.0849580979198,37.42211922626856,17
                  -122.0847469573047,37.42207183952619,17
                  -122.0845725380962,37.42209006729676,17
                  -122.0845954886723,37.42215932700895,17
                  -122.0838521118269,37.42227278564371,17
                  -122.083792243335,37.42203539112084,17
                  -122.0835076656616,37.42209006957106,17
                  -122.0834709464152,37.42200987395161,17
                  -122.0831221085748,37.4221046494946,17
                  -122.0829247374572,37.42226503990386,17
                  -122.0829339169385,37.42231242843094,17
                  -122.0833837359737,37.42225046087618,17
                  -122.0833607854248,37.42234159228745,17
                  -122.0834204551642,37.42237075460644,17
                  -122.083659133885,37.42251292011001,17
                  -122.0839758438952,37.42265873093781,17
                  -122.0842374743331,37.42265143972521,17
                  -122.0845036949503,37.4226514386435,17
                  -122.0848020460801,37.42261133916315,17
                  -122.0847882750515,37.42256395055121,17
                  -122.0848938459612,37.42257124044786,17 &lt;/coordinates>
              &lt;/LinearRing>
            &lt;/outerBoundaryIs>
          &lt;/Polygon>
        &lt;/Placemark>
        &lt;Placemark>
          &lt;name>Building 41&lt;/name>
          &lt;visibility>0&lt;/visibility>
          &lt;styleUrl>#transBluePoly&lt;/styleUrl>
          &lt;Polygon>
            &lt;extrude>1&lt;/extrude>
            &lt;altitudeMode>relativeToGround&lt;/altitudeMode>
            &lt;outerBoundaryIs>
              &lt;LinearRing>
                &lt;coordinates> -122.0857412771483,37.42227033155257,17
                  -122.0858169768481,37.42231408832346,17
                  -122.085852582875,37.42230337469744,17
                  -122.0858799945639,37.42225686138789,17
                  -122.0858860101409,37.4222311076138,17
                  -122.0858069157288,37.42220250173855,17
                  -122.0858379542653,37.42214027058678,17
                  -122.0856732640519,37.42208690214408,17
                  -122.0856022926407,37.42214885429042,17
                  -122.0855902778436,37.422128290487,17
                  -122.0855841672237,37.42208171967246,17
                  -122.0854852065741,37.42210455874995,17
                  -122.0855067264352,37.42214267949824,17
                  -122.0854430712915,37.42212783846172,17
                  -122.0850990714904,37.42251282407603,17
                  -122.0856769818632,37.42281815323651,17
                  -122.0860162273783,37.42244918858722,17
                  -122.0857260327004,37.42229239604253,17
                  -122.0857412771483,37.42227033155257,17 &lt;/coordinates>
              &lt;/LinearRing>
            &lt;/outerBoundaryIs>
          &lt;/Polygon>
        &lt;/Placemark>
        &lt;Placemark>
          &lt;name>Building 42&lt;/name>
          &lt;visibility>0&lt;/visibility>
          &lt;styleUrl>#transGreenPoly&lt;/styleUrl>
          &lt;Polygon>
            &lt;extrude>1&lt;/extrude>
            &lt;altitudeMode>relativeToGround&lt;/altitudeMode>
            &lt;outerBoundaryIs>
              &lt;LinearRing>
                &lt;coordinates> -122.0857862287242,37.42136208886969,25
                  -122.0857312990603,37.42136935989481,25
                  -122.0857312992918,37.42140934910903,25
                  -122.0856077073679,37.42138390166565,25
                  -122.0855802426516,37.42137299550869,25
                  -122.0852186221971,37.42137299504316,25
                  -122.0852277765639,37.42161656508265,25
                  -122.0852598189347,37.42160565894403,25
                  -122.0852598185499,37.42168200156,25
                  -122.0852369311478,37.42170017860346,25
                  -122.0852643957828,37.42176197982575,25
                  -122.0853239032746,37.42176198013907,25
                  -122.0853559454324,37.421852864452,25
                  -122.0854108752463,37.42188921823734,25
                  -122.0854795379357,37.42189285337048,25
                  -122.0855436229819,37.42188921797546,25
                  -122.0856260178042,37.42186013499926,25
                  -122.085937287963,37.42186013453605,25
                  -122.0859428718666,37.42160898590042,25
                  -122.0859655469861,37.42157992759144,25
                  -122.0858640462341,37.42147115002957,25
                  -122.0858548911215,37.42140571326184,25
                  -122.0858091162768,37.4214057134039,25
                  -122.0857862287242,37.42136208886969,25 &lt;/coordinates>
              &lt;/LinearRing>
            &lt;/outerBoundaryIs>
          &lt;/Polygon>
        &lt;/Placemark>
        &lt;Placemark>
          &lt;name>Building 43&lt;/name>
          &lt;visibility>0&lt;/visibility>
          &lt;styleUrl>#transYellowPoly&lt;/styleUrl>
          &lt;Polygon>
            &lt;extrude>1&lt;/extrude>
            &lt;altitudeMode>relativeToGround&lt;/altitudeMode>
            &lt;outerBoundaryIs>
              &lt;LinearRing>
                &lt;coordinates> -122.0844371128284,37.42177253003091,19
                  -122.0845118855746,37.42191111542896,19
                  -122.0850470999805,37.42178755121535,19
                  -122.0850719913391,37.42143663023161,19
                  -122.084916406232,37.42137237822116,19
                  -122.0842193868167,37.42137237801626,19
                  -122.08421938659,37.42147617161496,19
                  -122.0838086419991,37.4214613409357,19
                  -122.0837899728564,37.42131306410796,19
                  -122.0832796534698,37.42129328840593,19
                  -122.0832609819207,37.42139213944298,19
                  -122.0829373621737,37.42137236399876,19
                  -122.0829062425667,37.42151569778871,19
                  -122.0828502269665,37.42176282576465,19
                  -122.0829435788635,37.42176776969635,19
                  -122.083217411188,37.42179248552686,19
                  -122.0835970430103,37.4217480074456,19
                  -122.0839455556771,37.42169364237603,19
                  -122.0840077894637,37.42176283815853,19
                  -122.084113587521,37.42174801104392,19
                  -122.0840762473784,37.42171341292375,19
                  -122.0841447047739,37.42167881534569,19
                  -122.084144704223,37.42181720660197,19
                  -122.0842503333074,37.4218170700446,19
                  -122.0844371128284,37.42177253003091,19 &lt;/coordinates>
              &lt;/LinearRing>
            &lt;/outerBoundaryIs>
          &lt;/Polygon>
        &lt;/Placemark>
      &lt;/Folder>
      &lt;Folder>
        &lt;name>Extruded Polygon&lt;/name>
        &lt;description>A simple way to model a building&lt;/description>
        &lt;Placemark>
          &lt;name>The Pentagon&lt;/name>
          &lt;LookAt>
            &lt;longitude>-77.05580139178142&lt;/longitude>
            &lt;latitude>38.870832443487&lt;/latitude>
            &lt;heading>59.88865561738225&lt;/heading>
            &lt;tilt>48.09646074797388&lt;/tilt>
            &lt;range>742.0552506670548&lt;/range>
          &lt;/LookAt>
          &lt;Polygon>
            &lt;extrude>1&lt;/extrude>
            &lt;altitudeMode>relativeToGround&lt;/altitudeMode>
            &lt;outerBoundaryIs>
              &lt;LinearRing>
                &lt;coordinates> -77.05788457660967,38.87253259892824,100
                  -77.05465973756702,38.87291016281703,100
                  -77.05315536854791,38.87053267794386,100
                  -77.05552622493516,38.868757801256,100
                  -77.05844056290393,38.86996206506943,100
                  -77.05788457660967,38.87253259892824,100 &lt;/coordinates>
              &lt;/LinearRing>
            &lt;/outerBoundaryIs>
            &lt;innerBoundaryIs>
              &lt;LinearRing>
                &lt;coordinates> -77.05668055019126,38.87154239798456,100
                  -77.05542625960818,38.87167890344077,100
                  -77.05485125901024,38.87076535397792,100
                  -77.05577677433152,38.87008686581446,100
                  -77.05691162017543,38.87054446963351,100
                  -77.05668055019126,38.87154239798456,100 &lt;/coordinates>
              &lt;/LinearRing>
            &lt;/innerBoundaryIs>
          &lt;/Polygon>
        &lt;/Placemark>
      &lt;/Folder>
      &lt;Folder>
        &lt;name>Absolute and Relative&lt;/name>
        &lt;visibility>0&lt;/visibility>
        &lt;description>Four structures whose roofs meet exactly. Turn on/off
          terrain to see the difference between relative and absolute
          positioning.&lt;/description>
        &lt;LookAt>
          &lt;longitude>-112.3348969157552&lt;/longitude>
          &lt;latitude>36.14845533214919&lt;/latitude>
          &lt;altitude>0&lt;/altitude>
          &lt;heading>-86.91235037566909&lt;/heading>
          &lt;tilt>49.30695423894192&lt;/tilt>
          &lt;range>990.6761201087104&lt;/range>
        &lt;/LookAt>
        &lt;Placemark>
          &lt;name>Absolute&lt;/name>
          &lt;visibility>0&lt;/visibility>
          &lt;styleUrl>#transBluePoly&lt;/styleUrl>
          &lt;Polygon>
            &lt;tessellate>1&lt;/tessellate>
            &lt;altitudeMode>absolute&lt;/altitudeMode>
            &lt;outerBoundaryIs>
              &lt;LinearRing>
                &lt;coordinates> -112.3372510731295,36.14888505105317,1784
                  -112.3356128688403,36.14781540589019,1784
                  -112.3368169371048,36.14658677734382,1784
                  -112.3384408457543,36.14762778914076,1784
                  -112.3372510731295,36.14888505105317,1784 &lt;/coordinates>
              &lt;/LinearRing>
            &lt;/outerBoundaryIs>
          &lt;/Polygon>
        &lt;/Placemark>
        &lt;Placemark>
          &lt;name>Absolute Extruded&lt;/name>
          &lt;visibility>0&lt;/visibility>
          &lt;styleUrl>#transRedPoly&lt;/styleUrl>
          &lt;Polygon>
            &lt;extrude>1&lt;/extrude>
            &lt;tessellate>1&lt;/tessellate>
            &lt;altitudeMode>absolute&lt;/altitudeMode>
            &lt;outerBoundaryIs>
              &lt;LinearRing>
                &lt;coordinates> -112.3396586818843,36.14637618647505,1784
                  -112.3380597654315,36.14531751871353,1784
                  -112.3368254237788,36.14659596244607,1784
                  -112.3384555043203,36.14762621763982,1784
                  -112.3396586818843,36.14637618647505,1784 &lt;/coordinates>
              &lt;/LinearRing>
            &lt;/outerBoundaryIs>
          &lt;/Polygon>
        &lt;/Placemark>
        &lt;Placemark>
          &lt;name>Relative&lt;/name>
          &lt;visibility>0&lt;/visibility>
          &lt;LookAt>
            &lt;longitude>-112.3350152490417&lt;/longitude>
            &lt;latitude>36.14943123077423&lt;/latitude>
            &lt;altitude>0&lt;/altitude>
            &lt;heading>-118.9214100848499&lt;/heading>
            &lt;tilt>37.92486261093203&lt;/tilt>
            &lt;range>345.5169113679813&lt;/range>
          &lt;/LookAt>
          &lt;styleUrl>#transGreenPoly&lt;/styleUrl>
          &lt;Polygon>
            &lt;tessellate>1&lt;/tessellate>
            &lt;altitudeMode>relativeToGround&lt;/altitudeMode>
            &lt;outerBoundaryIs>
              &lt;LinearRing>
                &lt;coordinates> -112.3349463145932,36.14988705767721,100
                  -112.3354019540677,36.14941108398372,100
                  -112.3344428289146,36.14878490381308,100
                  -112.3331289492913,36.14780840132443,100
                  -112.3317019516947,36.14680755678357,100
                  -112.331131440106,36.1474173426228,100
                  -112.332616324338,36.14845453364654,100
                  -112.3339876620524,36.14926570522069,100
                  -112.3349463145932,36.14988705767721,100 &lt;/coordinates>
              &lt;/LinearRing>
            &lt;/outerBoundaryIs>
          &lt;/Polygon>
        &lt;/Placemark>
        &lt;Placemark>
          &lt;name>Relative Extruded&lt;/name>
          &lt;visibility>0&lt;/visibility>
          &lt;LookAt>
            &lt;longitude>-112.3351587892382&lt;/longitude>
            &lt;latitude>36.14979247129029&lt;/latitude>
            &lt;altitude>0&lt;/altitude>
            &lt;heading>-55.42811560891606&lt;/heading>
            &lt;tilt>56.10280503739589&lt;/tilt>
            &lt;range>401.0997279712519&lt;/range>
          &lt;/LookAt>
          &lt;styleUrl>#transYellowPoly&lt;/styleUrl>
          &lt;Polygon>
            &lt;extrude>1&lt;/extrude>
            &lt;tessellate>1&lt;/tessellate>
            &lt;altitudeMode>relativeToGround&lt;/altitudeMode>
            &lt;outerBoundaryIs>
              &lt;LinearRing>
                &lt;coordinates> -112.3348783983763,36.1514008468736,100
                  -112.3372535345629,36.14888517553886,100
                  -112.3356068927954,36.14781612679284,100
                  -112.3350034807972,36.14846469024177,100
                  -112.3358353861232,36.1489624162954,100
                  -112.3345888301373,36.15026229372507,100
                  -112.3337937856278,36.14978096026463,100
                  -112.3331798208424,36.1504472788618,100
                  -112.3348783983763,36.1514008468736,100 &lt;/coordinates>
              &lt;/LinearRing>
            &lt;/outerBoundaryIs>
          &lt;/Polygon>
        &lt;/Placemark>
      &lt;/Folder>
    &lt;/Folder>
  &lt;/Document>
&lt;/kml>
</pre>



<p>The following is the resulting CSV after running the above code snippet (new CSV file: <code>'my_file.csv'</code>):</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">36.10677870477137,-112.0814237830345,36.0905099328766,-112.0870267752693
36.10673460007995,-112.080622229595,36.09049598612422,-112.085242575315
36.09447672602546,-112.265654928602,36.09342608838671,-112.2660384528238,36.09251058776881,-112.2668139013453,36.09189827357996,-112.2677826834445,36.0913137941187,-112.2688557510952,36.0903677207521,-112.2694810717219,36.08932171487285,-112.2695268555611,36.08850916060472,-112.2690144567276,36.08753813597956,-112.2681528815339,36.08682685262568,-112.2670588176031,36.08646312301303,-112.2657374587321
36.07954952145647,-112.2550785337791,36.08117083492122,-112.2549277039738,36.08260761307279,-112.2552505069063,36.08395660588506,-112.2564540158376,36.08511401044813,-112.2580238976449,36.08584355239394,-112.2595218489022,36.08612634548589,-112.2608216347552,36.08626019085147,-112.262073428656,36.08621519860091,-112.2633204928495,36.08627897945274,-112.2644963846444,36.08649599090644,-112.2656969554589
36.09886943729116,-112.2532845153347,36.09919570465255,-112.2540466121145,36.09984998366178,-112.254734666947,36.10051310621746,-112.255493345654,36.10108441943419,-112.2563157098468,36.10159722088088,-112.2568033076439,36.10204323542867,-112.257494011321,36.10229131995655,-112.2584106072308,36.10240001286358,-112.2596588987972,36.10213176873407,-112.2610581199487,36.10157011437219,-112.2626285262793
36.09445214722695,-112.2656634181359,36.09520916122063,-112.2652238941097,36.09580763864907,-112.2645079986395,36.09628572284063,-112.2638827428817,36.09679275951239,-112.2635746835406,36.09740038871899,-112.2635711822407,36.09804913435539,-112.2640296531825,36.09880337400301,-112.264327720538,36.09963644790288,-112.2642436562271,36.10055381117246,-112.2639148687042,36.10149062823369,-112.2626894973474
37.42257124044786,-122.0848938459612,37.42211922626856,-122.0849580979198,37.42207183952619,-122.0847469573047,37.42209006729676,-122.0845725380962,37.42215932700895,-122.0845954886723,37.42227278564371,-122.0838521118269,37.42203539112084,-122.083792243335,37.42209006957106,-122.0835076656616,37.42200987395161,-122.0834709464152,37.4221046494946,-122.0831221085748,37.42226503990386,-122.0829247374572,37.42231242843094,-122.0829339169385,37.42225046087618,-122.0833837359737,37.42234159228745,-122.0833607854248,37.42237075460644,-122.0834204551642,37.42251292011001,-122.083659133885,37.42265873093781,-122.0839758438952,37.42265143972521,-122.0842374743331,37.4226514386435,-122.0845036949503,37.42261133916315,-122.0848020460801,37.42256395055121,-122.0847882750515,37.42257124044786,-122.0848938459612
37.42227033155257,-122.0857412771483,37.42231408832346,-122.0858169768481,37.42230337469744,-122.085852582875,37.42225686138789,-122.0858799945639,37.4222311076138,-122.0858860101409,37.42220250173855,-122.0858069157288,37.42214027058678,-122.0858379542653,37.42208690214408,-122.0856732640519,37.42214885429042,-122.0856022926407,37.422128290487,-122.0855902778436,37.42208171967246,-122.0855841672237,37.42210455874995,-122.0854852065741,37.42214267949824,-122.0855067264352,37.42212783846172,-122.0854430712915,37.42251282407603,-122.0850990714904,37.42281815323651,-122.0856769818632,37.42244918858722,-122.0860162273783,37.42229239604253,-122.0857260327004,37.42227033155257,-122.0857412771483
37.42136208886969,-122.0857862287242,37.42136935989481,-122.0857312990603,37.42140934910903,-122.0857312992918,37.42138390166565,-122.0856077073679,37.42137299550869,-122.0855802426516,37.42137299504316,-122.0852186221971,37.42161656508265,-122.0852277765639,37.42160565894403,-122.0852598189347,37.42168200156,-122.0852598185499,37.42170017860346,-122.0852369311478,37.42176197982575,-122.0852643957828,37.42176198013907,-122.0853239032746,37.421852864452,-122.0853559454324,37.42188921823734,-122.0854108752463,37.42189285337048,-122.0854795379357,37.42188921797546,-122.0855436229819,37.42186013499926,-122.0856260178042,37.42186013453605,-122.085937287963,37.42160898590042,-122.0859428718666,37.42157992759144,-122.0859655469861,37.42147115002957,-122.0858640462341,37.42140571326184,-122.0858548911215,37.4214057134039,-122.0858091162768,37.42136208886969,-122.0857862287242
37.42177253003091,-122.0844371128284,37.42191111542896,-122.0845118855746,37.42178755121535,-122.0850470999805,37.42143663023161,-122.0850719913391,37.42137237822116,-122.084916406232,37.42137237801626,-122.0842193868167,37.42147617161496,-122.08421938659,37.4214613409357,-122.0838086419991,37.42131306410796,-122.0837899728564,37.42129328840593,-122.0832796534698,37.42139213944298,-122.0832609819207,37.42137236399876,-122.0829373621737,37.42151569778871,-122.0829062425667,37.42176282576465,-122.0828502269665,37.42176776969635,-122.0829435788635,37.42179248552686,-122.083217411188,37.4217480074456,-122.0835970430103,37.42169364237603,-122.0839455556771,37.42176283815853,-122.0840077894637,37.42174801104392,-122.084113587521,37.42171341292375,-122.0840762473784,37.42167881534569,-122.0841447047739,37.42181720660197,-122.084144704223,37.4218170700446,-122.0842503333074,37.42177253003091,-122.0844371128284
38.87253259892824,-77.05788457660967,38.87291016281703,-77.05465973756702,38.87053267794386,-77.05315536854791,38.868757801256,-77.05552622493516,38.86996206506943,-77.05844056290393,38.87253259892824,-77.05788457660967
38.87154239798456,-77.05668055019126,38.87167890344077,-77.05542625960818,38.87076535397792,-77.05485125901024,38.87008686581446,-77.05577677433152,38.87054446963351,-77.05691162017543,38.87154239798456,-77.05668055019126
36.14888505105317,-112.3372510731295,36.14781540589019,-112.3356128688403,36.14658677734382,-112.3368169371048,36.14762778914076,-112.3384408457543,36.14888505105317,-112.3372510731295
36.14637618647505,-112.3396586818843,36.14531751871353,-112.3380597654315,36.14659596244607,-112.3368254237788,36.14762621763982,-112.3384555043203,36.14637618647505,-112.3396586818843
36.14988705767721,-112.3349463145932,36.14941108398372,-112.3354019540677,36.14878490381308,-112.3344428289146,36.14780840132443,-112.3331289492913,36.14680755678357,-112.3317019516947,36.1474173426228,-112.331131440106,36.14845453364654,-112.332616324338,36.14926570522069,-112.3339876620524,36.14988705767721,-112.3349463145932
36.1514008468736,-112.3348783983763,36.14888517553886,-112.3372535345629,36.14781612679284,-112.3356068927954,36.14846469024177,-112.3350034807972,36.1489624162954,-112.3358353861232,36.15026229372507,-112.3345888301373,36.14978096026463,-112.3337937856278,36.1504472788618,-112.3331798208424,36.1514008468736,-112.3348783983763
</pre>



<h2 class="wp-block-heading">How to Convert KMZ to CSV in Python?</h2>



<p>Files in the KML format are often packaged and distributed as KMZ files with the suffix <code>.kmz</code>. </p>



<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2139.png" alt="ℹ" class="wp-smiley" style="height: 1em; max-height: 1em;" /> KMZ files are zipped KML files with a special format for the content: a single root KML document named <code>doc.kml</code>. It has all additional files such as images and icons and 3D models located in the zip folder as well.</p>



<p class="has-global-color-8-background-color has-background">To convert a KMZ file to a CSV, you can unzip it and convert the root KML file to a <code>.csv</code> file in Python by using the <a rel="noreferrer noopener" href="https://blog.finxter.com/python-beautifulsoup-xml-to-dict-json-dataframe-csv/" data-type="post" data-id="474965" target="_blank">BeautifulSoup</a> and the <code>csv</code> libraries. You use the former to read the XML-structured KML file and the latter to write the CSV file row by row. </p>



<p>The remaining (non-CSV) contents of the zip folder, such as images, can hardly be converted to a CSV anyways.</p>



<p>See the code above for the KML to CSV conversion.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>
<p>The post <a href="https://blog.finxter.com/python-how-to-convert-kml-to-csv/">Python &#8211; How to Convert KML to CSV?</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Python BeautifulSoup XML to Dict, JSON, DataFrame, CSV</title>
		<link>https://blog.finxter.com/python-beautifulsoup-xml-to-dict-json-dataframe-csv/</link>
		
		<dc:creator><![CDATA[Jordan Marshall]]></dc:creator>
		<pubDate>Sat, 16 Jul 2022 15:04:30 +0000</pubDate>
				<category><![CDATA[BeautifulSoup]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Web Scraping]]></category>
		<category><![CDATA[XML]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=474965</guid>

					<description><![CDATA[<p>Though Python’s BeautifulSoup module was designed to scrape HTML files, it can also be used to parse XML files. In today’s professional marketplace, it is useful to be able to change an XML file into other formats, specifically dictionaries, CSV, JSON, and dataframes according to specific needs. In this article, we will discuss that process. ... <a title="Python BeautifulSoup XML to Dict, JSON, DataFrame, CSV" class="read-more" href="https://blog.finxter.com/python-beautifulsoup-xml-to-dict-json-dataframe-csv/" aria-label="Read more about Python BeautifulSoup XML to Dict, JSON, DataFrame, CSV">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/python-beautifulsoup-xml-to-dict-json-dataframe-csv/">Python BeautifulSoup XML to Dict, JSON, DataFrame, CSV</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p><em>Though Python’s BeautifulSoup module was designed to scrape HTML files, it can also be used to parse XML files. </em></p>



<p><em>In today’s professional marketplace, it is useful to be able to change an XML file into other formats, specifically dictionaries, CSV, JSON, and dataframes according to specific needs. </em></p>



<p><em>In this article, we will discuss that process.</em></p>



<h2 class="wp-block-heading">Scraping XML with BeautifulSoup</h2>



<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Extensible Markup Language</strong> or <strong><a rel="noreferrer noopener" href="https://blog.finxter.com/xml-developer-income-and-opportunity/" data-type="post" data-id="242005" target="_blank">XML</a></strong> differs from <a rel="noreferrer noopener" href="https://blog.finxter.com/html-developer-income-and-opportunity/" data-type="post" data-id="191232" target="_blank">HTML</a> in that HTML primarily deals with how information is displayed on a webpage, and XML handles how data is stored and transmitted. XML also uses custom tags and is designed to be user and machine-readable. </p>



<p>When inspecting a webpage, a statement at the top of the page will denote what type of file you are viewing. </p>



<p>For an XML file, you may see <code>&lt;?xml version="1.0"?&gt;</code>. </p>



<p class="has-base-background-color has-background">As a side note, “<code>version 1.0</code>” is a little deceiving in that several modifications have been made since its inception in 1998 the name has just not changed. </p>



<p>Despite the differences between HTML and XML, because <a rel="noreferrer noopener" href="https://blog.finxter.com/how-to-install-beautifulsoup-on-pycharm/" data-type="URL" data-id="https://blog.finxter.com/how-to-install-beautifulsoup-on-pycharm/" target="_blank">BeautifulSoup</a> creates a <strong>Python object tree</strong>, it can be used to parse both. The process for parsing both is similar. For this article, I will be using a sample XML file from <a rel="noreferrer noopener" href="https://www.w3schools.com/xml/cd_catalog.xml" data-type="URL" data-id="https://www.w3schools.com/xml/cd_catalog.xml" target="_blank">w3 schools.com</a>.</p>



<p>Import the BeautifulSoup library and requests modules to scrape this file.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># Import needed libraries
from pprint import pprint
from bs4 import BeautifulSoup
import requests</pre>



<p>Once these have been imported, request the content of the webpage.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># Request data
webpage = requests.get("https://www.w3schools.com/xml/cd_catalog.xml")
data = webpage.content
pprint(data)</pre>



<p>At this point, I like to print just to make sure I am getting what I need. I use the <code>pprint()</code> function to make it more readable.</p>



<p>Next, create a BeautifulSoup object and declare the parser to be used. Because it is an XML file, use an XML parser.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># Create a BeautifulSoup object
soup = BeautifulSoup(data, 'xml')
print(soup.prettify())</pre>



<p>With that printed, you can see the object tree created by BeautifulSoup. The parent, “<code>&lt;CATALOG&gt;</code>”, its child “<code>&lt;CD&gt;</code>”, and all of the children of “<code>CD</code>” are displayed.</p>



<p><strong>Output of the first CD:</strong></p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">&lt;CATALOG>
&lt;CD>
&lt;TITLE>Empire Burlesque&lt;/TITLE>
&lt;ARTIST>Bob Dylan&lt;/ARTIST>
&lt;COUNTRY>USA&lt;/COUNTRY>
&lt;COMPANY>Columbia&lt;/COMPANY>
&lt;PRICE>10.90&lt;/PRICE>
&lt;YEAR>1985&lt;/YEAR>
&lt;/CD></pre>



<p>All left is to scrape the desired data and display it. </p>



<p>Using the <a rel="noreferrer noopener" href="https://blog.finxter.com/python-enumerate/" data-type="post" data-id="20466" target="_blank"><code>enumerate()</code></a> and <code><a rel="noreferrer noopener" href="https://blog.finxter.com/parsing-xml-using-beautifulsoup-in-python/" data-type="post" data-id="17772" target="_blank">find_all()</a></code> function each occurrence of a tag can be found, and its contents can be placed into a <a href="https://blog.finxter.com/python-lists/" data-type="post" data-id="7332" target="_blank" rel="noreferrer noopener">list</a>. </p>



<p>After that, using a <code>for</code> loop, <a rel="noreferrer noopener" href="https://blog.finxter.com/python-unpacking/" data-type="post" data-id="396420" target="_blank">unpack</a> the created lists, and create groupings. The <code>.text</code> attribute string and <code><a rel="noreferrer noopener" href="https://blog.finxter.com/python-string-strip/" data-type="post" data-id="26104" target="_blank">strip()</a></code> function gives only the text and <a rel="noreferrer noopener" href="https://blog.finxter.com/how-to-remove-extra-whitespaces-in-beautifulsoup/" data-type="post" data-id="223870" target="_blank">removes the white space</a>. </p>



<p>Just for readability, <a href="https://blog.finxter.com/how-to-skip-a-line-in-python-using-n/" data-type="post" data-id="451007" target="_blank" rel="noreferrer noopener">print a blank line</a> after each grouping.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># Scrape data
parent = soup.find('CATALOG')
for n, tag in enumerate(parent.find_all('CD')):
    title = [x for x in tag.find_all('TITLE')]
    artist = [x for x in tag.find_all('ARTIST')]
    country = [x for x in tag.find_all('COUNTRY')]
    company = [x for x in tag.find_all('COMPANY')]
    price = [x for x in tag.find_all('PRICE')]
    year = [x for x in tag.find_all('YEAR')]
    # view data
    for item in title:
        print('Title: ', item.text.strip())
    for item in artist:
        print('Artist: ', item.text.strip())
    for item in country:
        print('Country: ', item.text.strip())
    for item in company:
        print('Company: ', item.text.strip())
    for item in price:
        print('Price: ', item.text.strip())
    for item in year:
        print('Year: ', item.text.strip())
    print()</pre>



<p>With that, the CDs should be cataloged in this format.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">Title:  Empire Burlesque
Artist:  Bob Dylan
Country:  USA
Company:  Columbia
Price:  10.90
Year:  1985 </pre>



<h2 class="wp-block-heading">XML to Dictionary</h2>



<p>Besides lists, <a href="https://blog.finxter.com/python-dictionary/" data-type="post" data-id="5232" target="_blank" rel="noreferrer noopener">dictionaries</a> are a common structure for storing data in Python. </p>



<p>Information is stored in key: value pairs. Those pairs are stored within curly <code>{}</code> brackets. </p>



<p class="has-base-background-color has-background"><strong>Example</strong>: <code>capital = {Pennsylvania: Harrisburg, Michigan: Lansing}</code></p>



<p>The key of the pair is case-sensitive and unique. The value can be any data type and may be duplicated. </p>



<p>Accessing the value of the pair can be done via the Key. Since the key cannot be duplicated, finding a value in a large dictionary is easy so long as you know the key. A key list can be obtained using the <code><a rel="noreferrer noopener" href="https://blog.finxter.com/python-dict-keys-method/" data-type="post" data-id="37711" target="_blank">keys()</a></code> method. </p>



<p class="has-base-background-color has-background"><strong>Example</strong>: <code>print(capital.keys())</code></p>



<p>Finding information in a dictionary is quick since you only search for a specific key. </p>



<p>Dictionaries are used quite often, if memory usage is not a concern, because of the quick access. For this reason, it is important to know how to convert information gained in an XML file to a dictionary. </p>



<p class="has-global-color-8-background-color has-background">There are six basic steps to convert an XML to a dictionary:</p>



<ol class="has-global-color-8-background-color has-background wp-block-list"><li><code>import xmltodict</code></li><li><code>import pprint</code></li><li><code>with open('C:\Users\Jordan Marshall\Downloads\cd_catalog.xml', 'r', encoding='utf-8') as file:</code><ul><li><code>cd_xml = file.read()</code></li></ul></li><li><code>cd_dict = xmltodict.parse(cd_xml)</code></li><li><code>cd_dict_list = [dict(x) for x in cd_dict['CATALOG']['CD']]</code></li><li><code>pprint.pprint(cd_dict_list)</code></li></ol>



<p>First, for the conversion, Python has a built-in called <code><a rel="noreferrer noopener" href="https://blog.finxter.com/how-to-install-xmltodict-in-python/" data-type="post" data-id="457149" target="_blank">xmltodict</a></code>. So first import that module and any other modules to be used.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import xmltodict
import pprint</pre>



<p>Second, the file needs to be opened, read, and assigned to a variable.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">with open('C:\\Users\\Jordan Marshall\\Downloads\\cd_catalog.xml', 'r', encoding='utf-8') as file:
    cd_xml = file.read()</pre>



<p>Third, using <code>xmltodict.parse()</code> convert the XML file to a dictionary and view it.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cd_dict = xmltodict.parse(cd_xml)
cd_dict_list = [dict(x) for x in cd_dict['CATALOG']['CD']]
pprint.pprint(cd_dict_list)</pre>



<p>The output of this is a nice clean <a rel="noreferrer noopener" href="https://blog.finxter.com/how-to-create-a-list-of-dictionaries-in-python/" data-type="post" data-id="10576" target="_blank">list of dictionaries</a>. To view all artists, a simple <code>for</code> loop can be used.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">for item in cd_dict_list:
    print(item['ARTIST'])</pre>



<h2 class="wp-block-heading">XML to JSON</h2>



<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>JSON</strong> stands for <strong>JavaScript Object Notation</strong>. These files store data in <code>key:value</code> form like a Python dictionary. JSON files are used primarily to transmit data between web applications and servers. </p>



<p>Converting an XML file to a JSON file requires only a few lines of code.&nbsp;</p>



<p>As always, import the needed libraries and modules.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import json
from pprint import pprint
import xmltodict</pre>



<p>Again, you will see the use of <code>xmltodict</code>. Because of their similarities, first, convert the file to a dictionary and then later write it to a JSON file. The <code>json_dumps()</code> function is used to take in the XML data. That data will later be written to a JSON file.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">with open('C:\\Users\\Jordan Marshall\\Downloads\\cd_catalog example.xml') as xml_file:
    data_dict = xmltodict.parse(xml_file.read())
    xml_file.close()
    json_data = json.dumps(data_dict)
    with open('data.json', 'w') as json_file:
        json_file.write(json_data)
        json_file.close()</pre>



<p><strong>Output</strong>:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">('{"CATALOG": {"CD": [{"TITLE": "Empire Burlesque", "ARTIST": "Bob Dylan", '
 '"COUNTRY": "USA", "COMPANY": "Columbia", "PRICE": "10.90", "YEAR": "1985"}, '
 '{"TITLE": "Hide your heart", "ARTIST": "Bonnie Tyler", "COUNTRY": "UK", '
 '"COMPANY": "CBS Records", "PRICE": "9.90", "YEAR": "1988"}, {"TITLE": '
 '"Greatest Hits", "ARTIST": "Dolly Parton", "COUNTRY": "USA", "COMPANY": '
 '"RCA", "PRICE": "9.90", "YEAR": "1982"}, {"TITLE": "Still got the blues", '….)
</pre>



<p>The data that started as an XML file has now been written to a JSON file called <code>json_data</code>.&nbsp;</p>



<h2 class="wp-block-heading">XML to DataFrame</h2>



<p>There are a couple of ways to achieve this goal. </p>



<p>Using Python’s <code>ElementTree</code> is one. I am, however, partial to <a rel="noreferrer noopener" href="https://blog.finxter.com/pandas-quickstart/" data-type="post" data-id="16511" target="_blank">Pandas</a>. </p>



<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Pandas</strong> is a great module for working with data, and it simplifies many daily tasks of a <a rel="noreferrer noopener" href="https://blog.finxter.com/python-developer-income-and-opportunity/" data-type="post" data-id="189354" target="_blank">programmer</a> and <a rel="noreferrer noopener" href="https://blog.finxter.com/data-scientist-income-and-opportunity/" data-type="post" data-id="332478" target="_blank">data scientist</a>. I strongly suggest <a href="https://blog.finxter.com/pandas-cheat-sheets/" data-type="post" data-id="7977" target="_blank" rel="noreferrer noopener">becoming familiar</a> with this module. </p>



<p>For this code, use a combination of BeautifulSoup and Pandas.</p>



<p>Import the necessary libraries.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd
from bs4 import BeautifulSoup</pre>



<p>To display the output fully, display values may need to be altered. I am going to set the max number of columns as well as the display width. This will overwrite any default settings that may be in place. </p>



<p>Without doing this, you may find some of your columns are replaced by ‘<code>…</code>’ or the columns may be displayed under your first couple of columns.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># set max columns and display width
pd.set_option("display.max_columns", 10)
pd.set_option("display.width", 1000)</pre>



<p>The width and columns can be changed according to your needs. With that completed, <a href="https://blog.finxter.com/python-open-function/" data-type="post" data-id="24793" target="_blank" rel="noreferrer noopener">open</a> and read the XML file. Store the contents in a variable.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">xml_file = open('C:\\Users\\Jordan Marshall\\Downloads\\cd_catalog.xml', 'r')
contents = xml_file.read()</pre>



<p>Next, create a BeautifulSoup object.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># BeautifulSoup object
soup = BeautifulSoup(contents, 'xml')</pre>



<p>The next step is to extract the data and assign it to a variable.&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># Extract data and assign it to a variable
title = soup.find_all("TITLE")
artist = soup.find_all("ARTIST")
country = soup.find_all("COUNTRY")
company = soup.find_all("COMPANY")
price = soup.find_all("PRICE")
year = soup.find_all("YEAR")</pre>



<p>Now a <code>for</code> loop can be used to extract the text. </p>



<p>Should data be added or removed at any time using the length of one of the variables removes the need to know from memory how many items are cataloged. </p>



<p>Place the text in an <a href="https://blog.finxter.com/how-to-create-an-empty-list-in-python/" data-type="post" data-id="453870" target="_blank" rel="noreferrer noopener">empty list</a>.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># Text
cd_info = []
for i in range(0, len(title)):
    rows = [title[i].get_text(),
            artist[i].get_text(),
            country[i].get_text(),
            company[i].get_text(),
            price[i].get_text(),
            year[i].get_text()]
    cd_info.append(rows)</pre>



<p>Lastly, create the data frame and name the columns.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># Create a dataframe with Pandas and print
df = pd.DataFrame(cd_info, columns=['Title', 'Artist ', '   Company', 'Country', '   Price', '   Year'])
print(df)</pre>



<p><strong>Output</strong></p>



<pre class="wp-block-preformatted"><code>            Title                  Artist              Country         Company      Price     Year
0           Empire Burlesque       Bob Dylan           USA             Columbia     10.90     1985
1           Hide your heart        Bonnie Tyler        UK              CBS Records  9.90      1988
2           Greatest Hits          Dolly Parton        USA             RCA          9.90      1982</code></pre>



<p>A nice, neat table containing each CD’s data has been created.</p>



<h2 class="wp-block-heading">XML to CSV</h2>



<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> A CSV file or comma-separated values file contains plain text easily readable by the user. It can contain numbers and letters only and is used to exchange data between apps. CSV files can be opened by any editor. </p>



<p>For example, Microsoft Excel. Each line represents a new row of data. The comma represents a new column. Using the code from above the XML file can be converted to a CSV file with one new line.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">catalog = df.to_csv('cd catalog.csv')</pre>



<p>With that, go to <code>files</code> and search the <code>C:</code> drive for <code>'cd catalog.csv'</code>. It will open in the default program used for spreadsheets. In this case Microsoft Excel.</p>



<figure class="wp-block-table is-style-stripes"><table><tbody><tr><td>Title</td><td>Artist&nbsp;</td><td>Country</td><td>Company</td><td>Price</td><td>&nbsp;Year</td></tr><tr><td>Empire Burlesque</td><td>Bob Dylan</td><td>USA</td><td>Columbia</td><td>10.90</td><td>1985</td></tr><tr><td>Hide your heart</td><td>Bonnie Tyler</td><td>UK</td><td>CBS Records</td><td>9.90</td><td>1988</td></tr><tr><td>Greatest Hits</td><td>Dolly Parton</td><td>USA</td><td>RCA</td><td>9.90</td><td>1982</td></tr><tr><td>Still got the blues</td><td>Gary Moore</td><td>UK</td><td>Virgin records</td><td>10.20</td><td>1990</td></tr><tr><td>Eros</td><td>Eros Ramazzotti</td><td>EU</td><td>BMG</td><td>9.90</td><td>1997</td></tr><tr><td>One night only</td><td>Bee Gees</td><td>UK</td><td>Polydor</td><td>10.90</td><td>1998</td></tr><tr><td>Sylvias Mother</td><td>Dr.Hook</td><td>UK</td><td>CBS</td><td>8.10</td><td>1973</td></tr><tr><td>Maggie May</td><td>Rod Stewart</td><td>UK</td><td>Pickwick</td><td>8.50</td><td>1990</td></tr><tr><td>Romanza</td><td>Andrea Bocelli</td><td>EU</td><td>Polydor</td><td>10.80</td><td>1996</td></tr></tbody></table></figure>



<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f30d.png" alt="🌍" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Related Tutorial</strong>: <a rel="noreferrer noopener" href="https://blog.finxter.com/python-how-to-convert-kml-to-csv/" data-type="URL" data-id="https://blog.finxter.com/python-how-to-convert-kml-to-csv/" target="_blank">How to Convert a KML to a CSV File in Python?</a></p>
<p>The post <a href="https://blog.finxter.com/python-beautifulsoup-xml-to-dict-json-dataframe-csv/">Python BeautifulSoup XML to Dict, JSON, DataFrame, CSV</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Scrape a Bookstore in 5 Steps Python [Learn Project]</title>
		<link>https://blog.finxter.com/scrape-a-bookstore-in-5-steps-a-python-learning-project/</link>
		
		<dc:creator><![CDATA[Chris]]></dc:creator>
		<pubDate>Tue, 14 Jun 2022 21:23:53 +0000</pubDate>
				<category><![CDATA[BeautifulSoup]]></category>
		<category><![CDATA[HTML]]></category>
		<category><![CDATA[Projects]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Web Development]]></category>
		<category><![CDATA[Web Scraping]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=422300</guid>

					<description><![CDATA[<p>Story: This series of articles assume you work in the IT Department of Mason Books. The Owner asks you to scrape the website of a competitor. He would like this information to gain insight into his pricing structure. 💡 Note: Before continuing, we recommend you possess, at minimum, a basic knowledge of HTML and CSS and ... <a title="Scrape a Bookstore in 5 Steps Python [Learn Project]" class="read-more" href="https://blog.finxter.com/scrape-a-bookstore-in-5-steps-a-python-learning-project/" aria-label="Read more about Scrape a Bookstore in 5 Steps Python [Learn Project]">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/scrape-a-bookstore-in-5-steps-a-python-learning-project/">Scrape a Bookstore in 5 Steps Python [Learn Project]</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p><em><strong>Story</strong>: This series of articles assume you work in the IT Department of Mason Books. The Owner asks you to scrape the website of a competitor. He would like this information to gain insight into his pricing structure.</em></p>



<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Note</strong>: Before continuing, we recommend you possess, at minimum, a basic knowledge of <a rel="noreferrer noopener" href="https://www.w3schools.com/html/" target="_blank">HTML</a> and <a rel="noreferrer noopener" href="https://www.w3schools.com/css/default.asp" target="_blank">CSS</a> and have reviewed our articles on <a rel="noreferrer noopener" href="https://blog.finxter.com/how-to-scrape-html-tables-part-1/" target="_blank">How to Scrape HTML tables</a>.</p>



<h2 class="wp-block-heading">What You&#8217;ll Build in This Project</h2>



<p>Let&#8217;s navigate to <a rel="noreferrer noopener" href="https://books.toscrape.com/index.html" data-type="URL" data-id="https://books.toscrape.com/index.html" target="_blank">Books to Scrape </a>and review the format. </p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="564" src="https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-01a-1024x564.png" alt="" class="wp-image-224055" srcset="https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-01a-1024x564.png 1024w, https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-01a-300x165.png 300w, https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-01a-768x423.png 768w, https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-01a.png 1247w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>At first glance, you will notice:</p>



<ul class="wp-block-list"><li>Book categories display on the left-hand side.</li><li>There are, in total, 1,000 books listed on the website.</li><li>Each web page shows 20 Books.</li><li>Each price is in £ (in this instance, the UK pound).</li><li>Each Book displays <strong>minimum </strong>details.</li><li>To view <strong>complete </strong>details for a book, click on the image or the <code>Book Title</code> hyperlink. This hyperlink forwards to a page containing additional book details for the selected item (see below).</li><li>The total number of website pages displays in the footer (<code>Page 1 of 50</code>).</li></ul>



<h2 class="wp-embed-aspect-16-9 wp-has-aspect-ratio wp-block-heading" id="getting-started">Step 1: Install and Import Libraries for Project</h2>



<p class="wp-embed-aspect-16-9 wp-has-aspect-ratio">Before any data manipulation can occur, three (3) new libraries will require installation.</p>



<ul class="wp-block-list"><li>The <em><a rel="noreferrer noopener" href="https://blog.finxter.com/pandas-quickstart/" data-type="URL" data-id="https://blog.finxter.com/pandas-quickstart/" target="_blank">Pandas</a></em> library enables access to/from a <em>DataFrame</em>.</li><li>The <em><a rel="noreferrer noopener" href="https://blog.finxter.com/best-python-requests-tutorials/" data-type="URL" data-id="https://blog.finxter.com/best-python-requests-tutorials/" target="_blank">Requests</a> </em>library provides access to the HTTP requests in Python.</li><li>The <a rel="noreferrer noopener" href="https://blog.finxter.com/web-scraping-with-beautifulsoup-in-python/" data-type="URL" data-id="https://blog.finxter.com/web-scraping-with-beautifulsoup-in-python/" target="_blank">Beautiful Soup </a>library enables data extraction from HTML and XML files.</li></ul>



<p>To install these libraries, navigate to an <a rel="noreferrer noopener" href="https://blog.finxter.com/best-python-ide/" data-type="post" data-id="8106" target="_blank">IDE</a> terminal. At the command prompt (<code>$</code>), execute the code below. For the terminal used in this example, the command prompt is a dollar sign (<code>$</code>). Your terminal prompt may be different.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">$ pip install pandas</pre>



<p>Hit the <code>&lt;Enter&gt;</code> key on the keyboard to start the installation process.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">$ pip install requests</pre>



<p>Hit the <code>&lt;Enter&gt;</code> key on the keyboard to start the installation process.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">$ pip install beautifulsoup4</pre>



<p>Hit the <code>&lt;Enter&gt;</code> key on the keyboard to start the installation process.</p>



<p>If the installations were successful, a message displays in the terminal indicating the same.</p>



<hr class="wp-block-separator has-css-opacity"/>



<p>Feel free to view the PyCharm installation guides for the required libraries.</p>



<ul class="wp-block-list"><li><a rel="noreferrer noopener" href="https://blog.finxter.com/how-to-install-pandas-in-python/" target="_blank"></a><a href="https://blog.finxter.com/how-to-install-pandas-on-pycharm/" data-type="URL" data-id="https://blog.finxter.com/how-to-install-pandas-on-pycharm/" target="_blank" rel="noreferrer noopener">How to install Pandas on PyCharm</a></li><li><a href="https://blog.finxter.com/how-to-install-requests-in-python/" data-type="URL" data-id="https://blog.finxter.com/how-to-install-requests-in-python/" target="_blank" rel="noreferrer noopener">How to install Requests on PyCharm</a></li><li><a href="https://blog.finxter.com/how-to-install-beautifulsoup-on-pycharm/" data-type="URL" data-id="https://blog.finxter.com/how-to-install-beautifulsoup-on-pycharm/" target="_blank" rel="noreferrer noopener">How to install BeautifulSoup4 on PyCharm</a></li></ul>



<hr class="wp-block-separator has-css-opacity"/>



<p>Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.</p>



<pre class="EnlighterJSRAW wp-embed-aspect-16-9 wp-has-aspect-ratio" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import urllib.request
from csv import reader, writer</pre>



<ul class="wp-block-list"><li>The <code>time</code> library is built-in with Python and does not require installation. This library contains <a rel="noreferrer noopener" href="https://blog.finxter.com/time-delay-in-python/" data-type="URL" data-id="https://blog.finxter.com/time-delay-in-python/" target="_blank"><code>time.sleep()</code></a> and is used to set a delay between page scrapes.</li><li>The <code>urllib</code> library is built-in with Python and does not require installation. This library contains <a rel="noreferrer noopener" href="https://blog.finxter.com/time-delay-in-python/" data-type="URL" data-id="https://blog.finxter.com/time-delay-in-python/" target="_blank"><code>urllib.request</code></a> and is used to save images.</li><li>The <code>csv </code>library is built-in  <code><em><a rel="noreferrer noopener" href="https://blog.finxter.com/pandas-quickstart/" data-type="URL" data-id="https://blog.finxter.com/pandas-quickstart/" target="_blank">Pandas</a></em></code> and does not require additional installation. This library contains <code>reader and writer</code> methods to save data to a CSV file.</li></ul>



<h2 class="wp-block-heading">Step 2: Understand Basics and Scrape Your First Results</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="909" height="462" src="https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-04a.png" alt="" class="wp-image-224220" srcset="https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-04a.png 909w, https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-04a-300x152.png 300w, https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-04a-768x390.png 768w" sizes="auto, (max-width: 909px) 100vw, 909px" /></figure>
</div>


<p>In this step, you&#8217;ll perform the following tasks:</p>



<ul class="wp-block-list" id="block-990dfa6f-f2e6-423a-84d3-3fbfcb432a12"><li>Reviewing the website to scrape.</li><li>Understanding HTTP Status Codes.</li><li>Connecting to the <a rel="noreferrer noopener" href="https://books.toscrape.com/index.html" target="_blank">Books to Scrape</a> website using the <code><a rel="noreferrer noopener" href="https://blog.finxter.com/python-requests-library/" target="_blank">requests</a> </code>library.</li><li>Retrieving&nbsp;Total Pages to Scrape</li><li>Closing the Open Connection.</li></ul>



<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f30d.png" alt="🌍" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Learn More</strong>: Learn everything you need to know to reproduce this step in the <a href="https://blog.finxter.com/scraping-a-bookstore-part-1/" data-type="URL" data-id="https://blog.finxter.com/scraping-a-bookstore-part-1/" target="_blank" rel="noreferrer noopener">in-depth Finxter blog tutorial</a>.</p>



<h2 class="wp-block-heading">Step 3: Configure URL to Scrape and Avoid Spamming the Server</h2>



<div class="wp-block-cover aligncenter is-light"><span aria-hidden="true" class="wp-block-cover__background has-background-dim"></span><img loading="lazy" decoding="async" width="886" height="672" class="wp-block-cover__image-background wp-image-422310" alt="" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-122.png" data-object-fit="cover" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-122.png 886w, https://blog.finxter.com/wp-content/uploads/2022/06/image-122-300x228.png 300w, https://blog.finxter.com/wp-content/uploads/2022/06/image-122-768x583.png 768w" sizes="auto, (max-width: 886px) 100vw, 886px" /><div class="wp-block-cover__inner-container is-layout-flow wp-block-cover-is-layout-flow">
<p class="has-text-align-center has-base-3-color has-text-color has-large-font-size"><strong>Rule: Don&#8217;t Spam the Server!</strong></p>
</div></div>



<p>In this step, you&#8217;ll perform the following tasks:</p>



<ul class="wp-block-list" id="block-30f20a4a-690b-43a9-bf02-27dbdcbfb3a7"><li>Configuring a page URL for scraping</li><li>Setting a delay: <a href="https://blog.finxter.com/time-delay-in-python/"><code>time.sleep()</code> </a>to pause between page scrapes.</li><li><a href="https://blog.finxter.com/python-loops/" target="_blank" rel="noreferrer noopener">Looping</a> through two (2) pages for testing purposes.</li></ul>



<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f30d.png" alt="🌍" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Learn More</strong>: Learn everything you need to know to reproduce this step in the <a href="https://blog.finxter.com/scraping-a-bookstore-part-2/" data-type="URL" data-id="https://blog.finxter.com/scraping-a-bookstore-part-2/" target="_blank" rel="noreferrer noopener">in-depth Finxter blog tutorial</a>.</p>



<h2 class="wp-block-heading">Step 4: Save Book Details in a Python List</h2>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="709" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-123-1024x709.png" alt="" class="wp-image-422311" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-123-1024x709.png 1024w, https://blog.finxter.com/wp-content/uploads/2022/06/image-123-300x208.png 300w, https://blog.finxter.com/wp-content/uploads/2022/06/image-123-768x532.png 768w, https://blog.finxter.com/wp-content/uploads/2022/06/image-123.png 1268w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>In this step, you&#8217;ll perform the following tasks:</p>



<ul class="wp-block-list"><li>Locating Book details.</li><li>Writing code to retrieve this information for all Books.</li><li>Saving <code>Book</code> details to a <a href="https://blog.finxter.com/python-lists/" target="_blank" rel="noreferrer noopener">List</a>.</li></ul>



<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f30d.png" alt="🌍" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Learn More</strong>: Learn everything you need to know to reproduce this step in the <a href="https://blog.finxter.com/scraping-a-bookstore-part-3/" data-type="URL" data-id="https://blog.finxter.com/scraping-a-bookstore-part-3/" target="_blank" rel="noreferrer noopener">in-depth Finxter blog tutorial</a>.</p>



<h2 class="wp-block-heading">Step 5: Clean and Save the Scraped Output</h2>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="340" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-124-1024x340.png" alt="" class="wp-image-422312" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-124-1024x340.png 1024w, https://blog.finxter.com/wp-content/uploads/2022/06/image-124-300x100.png 300w, https://blog.finxter.com/wp-content/uploads/2022/06/image-124-768x255.png 768w, https://blog.finxter.com/wp-content/uploads/2022/06/image-124.png 1030w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>In this step, you&#8217;ll perform the following tasks:</p>



<ul class="wp-block-list"><li>Cleaning up the scraped code.</li><li>Saving the output to a <a rel="noreferrer noopener" href="https://blog.finxter.com/how-to-read-a-csv-file-into-a-python-list/" target="_blank">CSV </a>file.</li></ul>



<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f30d.png" alt="🌍" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Learn More</strong>: Learn everything you need to know to reproduce this step in the <a href="https://blog.finxter.com/scraping-a-bookstore-part-4/" data-type="URL" data-id="https://blog.finxter.com/scraping-a-bookstore-part-4/" target="_blank" rel="noreferrer noopener">in-depth Finxter blog tutorial</a>.</p>



<h2 class="wp-block-heading">Conclusion</h2>



<p>This tutorial has guided you through the steps to create your first practical web scraping project: scraping the contents of a book store! </p>



<p>Now, go out and use your skills wisely and to the benefit of humanity, my friend! <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p></p>
<p>The post <a href="https://blog.finxter.com/scrape-a-bookstore-in-5-steps-a-python-learning-project/">Scrape a Bookstore in 5 Steps Python [Learn Project]</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/?utm_source=w3tc&utm_medium=footer_comment&utm_campaign=free_plugin

Page Caching using Disk: Enhanced 
Minified using Disk

Served from: blog.finxter.com @ 2026-04-19 23:02:40 by W3 Total Cache
-->