<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Charles Blue, Author at Be on the Right Side of Change</title>
	<atom:link href="https://blog.finxter.com/author/charlesblue/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.finxter.com/author/charlesblue/</link>
	<description></description>
	<lastBuildDate>Thu, 19 Oct 2023 08:56:59 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://blog.finxter.com/wp-content/uploads/2020/08/cropped-cropped-finxter_nobackground-32x32.png</url>
	<title>Charles Blue, Author at Be on the Right Side of Change</title>
	<link>https://blog.finxter.com/author/charlesblue/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>How I Scraped Data From Over 16,000 Gyms from MindBodyOnline.com</title>
		<link>https://blog.finxter.com/how-i-scraped-data-from-over-16000-gyms-from-mindbodyonline-com/</link>
		
		<dc:creator><![CDATA[Charles Blue]]></dc:creator>
		<pubDate>Thu, 19 Oct 2023 08:56:58 +0000</pubDate>
				<category><![CDATA[BeautifulSoup]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Python Requests]]></category>
		<category><![CDATA[Web Scraping]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=1652308</guid>

					<description><![CDATA[<p>This article is based on a freelance job posted on Upwork to scrape data for all the gyms in the USA from MindBodyOnline.com or another similar site. I treated this as a learning project, and it was a good one, as I learned a lot! 🕷️ Web scraping, a technique used to extract data from ... <a title="How I Scraped Data From Over 16,000 Gyms from MindBodyOnline.com" class="read-more" href="https://blog.finxter.com/how-i-scraped-data-from-over-16000-gyms-from-mindbodyonline-com/" aria-label="Read more about How I Scraped Data From Over 16,000 Gyms from MindBodyOnline.com">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/how-i-scraped-data-from-over-16000-gyms-from-mindbodyonline-com/">How I Scraped Data From Over 16,000 Gyms from MindBodyOnline.com</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>This article is based on a freelance job posted on Upwork to scrape data for all the gyms in the USA from <a href="http://MindBodyOnline.com">MindBodyOnline.com</a> or another similar site. I treated this as a learning project, and it was a good one, as I learned a lot!</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img fetchpriority="high" decoding="async" width="1024" height="598" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-156-1024x598.png" alt="" class="wp-image-1652328" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-156-1024x598.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/10/image-156-300x175.png 300w, https://blog.finxter.com/wp-content/uploads/2023/10/image-156-768x449.png 768w, https://blog.finxter.com/wp-content/uploads/2023/10/image-156-1536x897.png 1536w, https://blog.finxter.com/wp-content/uploads/2023/10/image-156.png 1585w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p class="has-base-2-background-color has-background"><strong><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f577.png" alt="🕷" class="wp-smiley" style="height: 1em; max-height: 1em;" /></strong> <strong>Web scraping</strong>, a technique used to extract data from websites, has become an essential skill on Upwork &#8212; it&#8217;s one of the most sought-after skills on most <a href="https://blog.finxter.com/best-python-freelancer-platforms/">freelancing platforms</a>. Most beginners start with the <strong><a href="https://blog.finxter.com/installing-beautiful-soup/">Beautiful Soup</a></strong> and <strong><a href="https://blog.finxter.com/python-requests-library-2/">Requests</a></strong> modules in Python. While these tools are powerful, they&#8217;re not always sufficient for every site. Enter tools like <strong><a href="https://blog.finxter.com/how-to-open-a-url-in-python-selenium/">Selenium</a></strong>, which, while powerful, can sometimes be overkill or inefficient. </p>



<p>So, where should one start? The answer is simple: Always check for an API first.</p>



<h3 class="wp-block-heading">Why Start with APIs?</h3>



<p>An <strong>Application Programming Interface (API)</strong> allows two software applications to communicate with each other. Many websites offer APIs to provide structured access to their data, making it easier and more efficient than scraping the web pages directly.</p>



<p>Benefits of using APIs:</p>



<ul class="wp-block-list">
<li><strong>Efficiency</strong>: Extracting data from APIs is often faster and less resource-intensive than scraping web pages.</li>



<li><strong>Reliability</strong>: APIs are designed to be accessed programmatically, reducing the chances of breaking changes.</li>



<li><strong>Ethical considerations</strong>: Accessing data via an API is often more in line with a website&#8217;s terms of service than scraping their pages directly.</li>
</ul>



<p>MindBodyOnline provides a dedicated API tailored for developers: <a href="https://developers.mindbodyonline.com/ui/documentation/public-api#/http/mindbody-public-api-v6-0/introduction/getting-started">MindBody API</a>. </p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img decoding="async" width="1024" height="536" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-140-1024x536.png" alt="" class="wp-image-1652310" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-140-1024x536.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/10/image-140-300x157.png 300w, https://blog.finxter.com/wp-content/uploads/2023/10/image-140-768x402.png 768w, https://blog.finxter.com/wp-content/uploads/2023/10/image-140.png 1426w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>If you&#8217;re aiming to craft an app utilizing their dataset, this API is your ideal resource. It boasts a plethora of endpoints, enabling swift data retrieval and ensuring seamless interaction between your application and their servers.</p>



<p><strong>But what if you aren’t creating an application and just need to scrape data once for research?</strong> MindBodyOnline also retrieves data for its website via an API. Javascript is used to request the data needed to populate their website. We can also make requests for this API.</p>



<h2 class="wp-block-heading">How to check if a website is rendered with Javascript</h2>



<p>The site we will be scraping is <a href="https://www.mindbodyonline.com/explore">MindBodyOnline</a>. </p>



<p>If a website is rendered with <a href="https://blog.finxter.com/javascript-data-types/">Javascript</a>, we should check the network traffic and see if we can find a request that returns the data we see on the page. This can be done quickly with developer tools. With Chrome, you can bring up developer tools by clicking <code>Ctl-Shift-I</code>. </p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img decoding="async" width="1024" height="709" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-143-1024x709.png" alt="" class="wp-image-1652313" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-143-1024x709.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/10/image-143-300x208.png 300w, https://blog.finxter.com/wp-content/uploads/2023/10/image-143-768x531.png 768w, https://blog.finxter.com/wp-content/uploads/2023/10/image-143-1536x1063.png 1536w, https://blog.finxter.com/wp-content/uploads/2023/10/image-143.png 1620w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>From here, we can turn off Javascript, then refresh the page and see if there are any changes. To turn off Javascript, first hit <code>Ctl-Shift-P</code> to bring up the command palette. Start typing Javascript to filter the options, then click “Disable javascript”.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="506" height="117" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-141.png" alt="" class="wp-image-1652311" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-141.png 506w, https://blog.finxter.com/wp-content/uploads/2023/10/image-141-300x69.png 300w" sizes="auto, (max-width: 506px) 100vw, 506px" /></figure>
</div>


<p>Then refresh the page. As we can see, they use JavaScript for all the data.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="444" height="95" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-142.png" alt="" class="wp-image-1652312" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-142.png 444w, https://blog.finxter.com/wp-content/uploads/2023/10/image-142-300x64.png 300w" sizes="auto, (max-width: 444px) 100vw, 444px" /></figure>
</div>


<p>Before we can continue, we need to turn JavaScript back on. Bring up the command palette again, filter for javascript, and click “Enable Javascript”. Then refresh the page again.</p>



<h2 class="wp-block-heading">Check the JavaScript Requests</h2>



<p>Select the Network tab in developer tools.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="624" height="152" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-144.png" alt="" class="wp-image-1652314" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-144.png 624w, https://blog.finxter.com/wp-content/uploads/2023/10/image-144-300x73.png 300w" sizes="auto, (max-width: 624px) 100vw, 624px" /></figure>
</div>


<p>Make sure <code>Fetch/XHR</code> and <code>Preserve log</code> are selected. Next, we can click the circle with the line through it to clear the output. Then perform a search to see what requests were performed.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="624" height="192" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-145.png" alt="" class="wp-image-1652315" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-145.png 624w, https://blog.finxter.com/wp-content/uploads/2023/10/image-145-300x92.png 300w" sizes="auto, (max-width: 624px) 100vw, 624px" /></figure>
</div>


<p>We can then check each item in the output to see if it returns useful information.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="601" height="255" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-146.png" alt="" class="wp-image-1652316" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-146.png 601w, https://blog.finxter.com/wp-content/uploads/2023/10/image-146-300x127.png 300w" sizes="auto, (max-width: 601px) 100vw, 601px" /></figure>
</div>


<p>We are primarily interested in the response to the request. We are looking for XML data that looks like the data shown on the page. In this case, it is the <code>locations</code> request that contains the data we seek.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="624" height="395" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-147.png" alt="" class="wp-image-1652317" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-147.png 624w, https://blog.finxter.com/wp-content/uploads/2023/10/image-147-300x190.png 300w" sizes="auto, (max-width: 624px) 100vw, 624px" /></figure>
</div>


<p>We can also see that there is a payload required. When we make our requests, we must provide this payload in the request body. There are three items of interest here. The latitude and longitude allow us to control the city we are pulling data for, and we also need to provide a page number.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="562" height="111" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-148.png" alt="" class="wp-image-1652318" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-148.png 562w, https://blog.finxter.com/wp-content/uploads/2023/10/image-148-300x59.png 300w" sizes="auto, (max-width: 562px) 100vw, 562px" /></figure>
</div>


<p>MindBody uses pagination, so a relatively small amount of data is pulled with each request. A large city like New York can have over a hundred pages.</p>



<p>We go to the headers tab to copy the request URL.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="624" height="120" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-149.png" alt="" class="wp-image-1652319" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-149.png 624w, https://blog.finxter.com/wp-content/uploads/2023/10/image-149-300x58.png 300w" sizes="auto, (max-width: 624px) 100vw, 624px" /></figure>
</div>


<h2 class="wp-block-heading">Using Insomnia to Generate Request Headers</h2>



<p>From here, we can use a tool to help us with the request syntax. </p>



<p class="has-base-2-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f517.png" alt="🔗" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Insomnia</strong> is a powerful open-source API client tool for testing and debugging APIs. It provides a user-friendly interface to send requests to web services and view responses. With Insomnia, you can define various request types, from simple HTTP GET requests to complex JSON, GraphQL, or even multipart file uploads. You can download the insomnia desktop app <a href="https://insomnia.rest/download">here</a>.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="601" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-150-1024x601.png" alt="" class="wp-image-1652320" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-150-1024x601.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/10/image-150-300x176.png 300w, https://blog.finxter.com/wp-content/uploads/2023/10/image-150-768x451.png 768w, https://blog.finxter.com/wp-content/uploads/2023/10/image-150.png 1342w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>Using Insomnia is quite simple. Just paste in the API URL and click <code>Send</code>.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="624" height="152" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-151.png" alt="" class="wp-image-1652321" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-151.png 624w, https://blog.finxter.com/wp-content/uploads/2023/10/image-151-300x73.png 300w" sizes="auto, (max-width: 624px) 100vw, 624px" /></figure>
</div>


<p>We can check the preview tab to make sure it returns the data we want:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="493" height="506" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-152.png" alt="" class="wp-image-1652322" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-152.png 493w, https://blog.finxter.com/wp-content/uploads/2023/10/image-152-292x300.png 292w" sizes="auto, (max-width: 493px) 100vw, 493px" /></figure>
</div>


<p>This is where it gets good. If we click the dropdown on the send button, one of the options is “generate client code”. How convenient! Just click Python as the language and use the Requests module and you can click “Copy to Clipboard” and you’re off to the races.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="624" height="397" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-153.png" alt="" class="wp-image-1652323" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-153.png 624w, https://blog.finxter.com/wp-content/uploads/2023/10/image-153-300x191.png 300w" sizes="auto, (max-width: 624px) 100vw, 624px" /></figure>
</div>


<h2 class="wp-block-heading">A Simple Scrapy Spider</h2>



<p>The code can be found on <a href="https://github.com/PythonCB/Scrape_MindBodyOnline">Github</a>. I will walk through the code below, starting with the imports.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import scrapy
import json
import pandas as pd
from scrapy.crawler import CrawlerProcess
import os
</pre>



<p><a href="https://blog.finxter.com/python-scrapy-scraping-dynamic-website-with-api-generated-content/">Scrapy</a> is a good option because it can handle multiple requests at the same time with <a href="https://blog.finxter.com/python-async-for-mastering-asynchronous-iteration-in-python/">asynchronous</a> processing. Scapy has a lot of bells and whistles and a fair bit of a learning curve, but it’s also possible to avoid a lot of the extra complexity. The goal here was to place all the code in one simple script.</p>



<p>First, we have to create a spider class. The class is pretty large so I’ll display it in chunks.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">class MindbodySpider(scrapy.Spider):
    name = 'mindbody_spider'

    custom_settings = {
        'CONCURRENT_REQUESTS': 5,
        'DOWNLOAD_DELAY': 3.2,
    }
</pre>



<p>Our class inherits from one of the Scrapy <code>Spider</code> classes with <code>scrapy.Spider</code> being the simplest. In the custom settings, with <code>CONCURRENT_REQUESTS</code> set to <code>5</code>, scrapy will be processing six requests at a time, starting a new one as soon as one finishes. </p>



<p>We use a <code>DOWNLOAD_DELAY</code> so we don’t bombard the website with too many requests at once.</p>



<p>Next, we need a starting template for the payload</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">starting_payload = '''{
                          "sort":"-_score,distance",
                          "page":{"size":50,"number":&lt;&lt;num>>},
                          "filter":{"categories":"any",
                                    "latitude":&lt;&lt;lat>>,
                                    "longitude":&lt;&lt;lon>>,
                                    "categoryTypes":"any"}
                       }'''
</pre>



<p>Next, we have the headers that Insomnia so helpfully provided for us.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">headers = {
        "cookie": "__cf_bm=zdIhLHXKd2OAveBChKORUMdydUFVzC2Ma51sQxv.UJ0-1694646164-0-Abmbwcj2wNw%2FpityY4DWRWy%2FftBkjTO0vQ3tZ0gwU0P5bsTqcasf2XZlBwL%2BUaevGaH%2BTDzZOJPBXbWYwgsXkJc%3D",
        "authority": "prod-mkt-gateway.mindbody.io",
        "accept": "application/vnd.api+json",
        "accept-language": "en-US,en;q=0.9",
        "content-type": "application/json",
        "origin": "https://www.mindbodyonline.com",
        "sec-ch-ua": "^\^Not/A",
        "sec-ch-ua-mobile": "?0",
        "sec-ch-ua-platform": "^\^Windows^^",
        "sec-fetch-dest": "empty",
        "sec-fetch-mode": "cors",
        "sec-fetch-site": "cross-site",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
        "x-mb-app-build": "2023-08-02T13:33:44.200Z",
        "x-mb-app-name": "mindbody.io",
        "x-mb-app-version": "e5d1fad6",
        "x-mb-user-session-id": "oeu1688920580338r0.2065068094427127"
    }
</pre>



<p>Then a very simple <code>init</code> method</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def __init__(self):
        scrapy.Spider.__init__(self)
        self.city_count = 0
</pre>



<p>The <code>start_requests</code> method loops through each city. This is the main loop that creates the first request for each city.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def start_requests(self):
        cities = pd.read_csv('uscities.csv')

        for idx, city in cities[].iterrows():
            lat, lon = city.lat, city.lng
            self.logger.info(f"{city.city}, {city.state_id} started")

            # Start with the first page for each city
            payload = self.starting_payload.replace('&lt;&lt;pg>>', '1').replace('&lt;&lt;lat>>', str(lat)).replace('&lt;&lt;lon>>', str(lon))

            yield scrapy.Request(
                url="https://prod-mkt-gateway.mindbody.io/v1/search/locations",
                method="GET",
                body=payload,
                headers=self.headers,
                meta={'city_name': city.city, 'page_num': 1, 'lat': lat, 'lon': lon, 'state': city.state_id},
                callback=self.parse
            )
</pre>



<p>The code is pretty simple. We <a href="https://blog.finxter.com/how-to-create-a-dataframe-in-pandas/">create a DataFrame</a> from a <a href="https://blog.finxter.com/read-a-csv-file-to-a-pandas-dataframe/">CSV file</a> with city information and then loop through it with the <code>iterrows</code> method. We create the payload for the request using the template and the lat/long values from the DataFrame. The page is set to 1 each time. We will handle additional pages later.</p>



<p>Finally, we yield a <code>scrapy.Request</code> object. We use <code><a href="https://blog.finxter.com/yield-keyword-in-python-a-simple-illustrated-guide/">yield</a></code> instead of <code><a href="https://blog.finxter.com/python-return/">return</a></code> so we can handle <a href="https://blog.finxter.com/python-async-requests-getting-urls-concurrently-via-https/">multiple requests concurrently</a>. The body is our modified payload, and we use the same header for each request.</p>



<p>What do we do with the response returned from the request? As soon as the response is returned it is fed into the parse method thanks to the callback parameter:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">callback=self.parse</pre>



<p>The <code>meta</code> parameter gives us a way to pass information to the <code>callback</code> function. We need the page <code>num</code>, <code>lat</code>, <code>lon</code> values for the next request.  <code>City_name</code> and <code>state</code> are used for screen outputs.</p>



<p>The list of cities was downloaded off the web. Many different options will work, as long as they contain latitude and longitude values.</p>



<h2 class="wp-block-heading">Parsing the Response</h2>



<p>The <code>parse</code> method is a little long, but not too complicated. </p>



<p>Getting the data and saving it is very easy. We just convert <code>response.text</code> to a DataFrame and <a href="https://blog.finxter.com/how-to-export-pandas-dataframe-to-csv-example/">save it to a CSV file</a>. If the file already exists, we will append the data and not include a header. Otherwise, we create a new CSV file and include a header.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def parse(self, response):
        data = json.loads(response.text)
        gyms_df = pd.json_normalize(data['data'])

        # Save the dataframe to a CSV
        city_name = response.meta['city_name']
        state = response.meta['state']
        fname = f'{city_name} {state}.csv'.replace(' ', '_')
        csv_path = f'./data/cities2/{fname}'

        # Check if file exists to determine the write mode
        write_mode = 'a' if os.path.exists(csv_path) else 'w'

        gyms_df.to_csv(csv_path, 
                       mode=write_mode, 
                       index=False, 
                       header=(not os.path.exists(csv_path)))         
</pre>



<h2 class="wp-block-heading">Handling Pagination</h2>



<p>To move on to the next page, we need to create another Scrapy Request. For the payload we use the same latitude and longitude and increment the page number by 1.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">        # Check if there's another page and if so, initiate the request
        next_page_num = response.meta['page_num'] + 1
        if next_page_num &lt;= 150:  # Optional: upper limit
            lat, lon = response.meta['lat'], response.meta['lon']  # Assuming you store lat and lon in meta too

            payload = self.starting_payload.replace('&lt;&lt;pg>>', '1').replace('&lt;&lt;lat>>', str(lat)).replace('&lt;&lt;lon>>', str(lon))
</pre>



<h2 class="wp-block-heading">Make the Request for the Next Page</h2>



<p>To finish the <code>parse</code> method, all we have to do is make another request with the new payload.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">yield scrapy.Request(
                url="https://prod-mkt-gateway.mindbody.io/v1/search/locations",
                method="GET",
                body=payload,
                headers=self.headers,
                meta={'city_name': response.meta['city_name'], 
                      'page_num': next_page_num, 
                      'lat': lat, 
                      'lon': lon,
                      'state': state},
                callback=self.parse
            )

        self.city_count += 1
        print(response.meta['city_name'], f'complete ({self.city_count})')
        self.logger.info(f"""{response.meta['city_name']}, 
                           {response.meta['state']} is complete""")
</pre>



<h2 class="wp-block-heading">How the Pagination Loop Terminates</h2>



<p>What happens if there are 100 pages for the current city and the code sends a request with <code>page_num = 101</code>? </p>



<p>The request will not return anything, so the callback function won’t get called and the recursive loop for that city will stop. </p>



<p>Then the <code>start_requests</code> loop will move on to the next city.</p>



<h2 class="wp-block-heading">It’s alive! Setting Our Little Spider Loose</h2>



<p>To get our creepy critter crawling, we create a <code>CrawlerProcess</code>. Then tell it to crawl. Then tell it to start. On your mark, get set, CRAWL!</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">process = CrawlerProcess()
process.crawl(MindbodySpider)
process.start()
</pre>



<h2 class="wp-block-heading">Results</h2>



<p>I was able to scrape data for 16,000 cities in about half a week. I think I averaged about 100 cities an hour. The larger cities had over a hundred pages but there were <strong>thousands upon thousands of cities with 5-10 pages</strong>.</p>



<p>What about the data? It’s fairly extensive and could be very useful.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="518" height="788" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-154.png" alt="" class="wp-image-1652324" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-154.png 518w, https://blog.finxter.com/wp-content/uploads/2023/10/image-154-197x300.png 197w" sizes="auto, (max-width: 518px) 100vw, 518px" /></figure>
</div>


<p>Pretty good information related to services offered, location, amenities, total ratings etc. Looking at the rest of the columns:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="510" height="396" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-155.png" alt="" class="wp-image-1652325" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-155.png 510w, https://blog.finxter.com/wp-content/uploads/2023/10/image-155-300x233.png 300w" sizes="auto, (max-width: 510px) 100vw, 510px" /></figure>
</div>


<h2 class="wp-block-heading">Conclusion</h2>



<p>Uncovering the API proved invaluable. It eliminated the need to craft path selectors for individual data elements, significantly streamlining the process. Moreover, it spared me from devising a Scrapy workaround for the JavaScript-rendered page. Investing time in learning Scrapy was a sound decision, given its superior speed compared to other methods I explored.</p>



<p>Looking ahead, the logical progression is to integrate the data into platforms like Jupyter Notebook, Power BI, or Tableau. Furthermore, storing the data in a database seems apt, especially considering the apparent one-to-many relationships observed in each city, like categories and subcategories.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>If you want to become a master web scraper, feel free to check out our academy course with downloadable PDF certificate to showcase your skills to future employers or freelancing clients:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="800" height="341" src="https://blog.finxter.com/wp-content/uploads/2023/10/image-157.png" alt="" class="wp-image-1652329" srcset="https://blog.finxter.com/wp-content/uploads/2023/10/image-157.png 800w, https://blog.finxter.com/wp-content/uploads/2023/10/image-157-300x128.png 300w, https://blog.finxter.com/wp-content/uploads/2023/10/image-157-768x327.png 768w" sizes="auto, (max-width: 800px) 100vw, 800px" /></figure>
</div>


<p class="has-base-2-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f517.png" alt="🔗" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Academy</strong>: <a href="https://academy.finxter.com/university/web-scraping-with-beautifulsoup/">Web Scraping with BeautifulSoup</a></p>
<p>The post <a href="https://blog.finxter.com/how-i-scraped-data-from-over-16000-gyms-from-mindbodyonline-com/">How I Scraped Data From Over 16,000 Gyms from MindBodyOnline.com</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How I Cracked the Top 100 in the Kaggle House Prices Competition</title>
		<link>https://blog.finxter.com/how-i-cracked-the-top-100-in-the-kaggle-house-prices-competition/</link>
		
		<dc:creator><![CDATA[Charles Blue]]></dc:creator>
		<pubDate>Wed, 07 Jun 2023 09:04:29 +0000</pubDate>
				<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Data Visualization]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[sklearn]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=1420178</guid>

					<description><![CDATA[<p>Kaggle is a vibrant online community for data science and machine learning, providing a platform for learning, sharing, and competition. It&#8217;s an invaluable resource for individuals interested in these fields, regardless of their level of experience. The Kaggle House Prices &#8211; Advanced Regression Techniques Competition, in particular, is an excellent starting point for anyone who ... <a title="How I Cracked the Top 100 in the Kaggle House Prices Competition" class="read-more" href="https://blog.finxter.com/how-i-cracked-the-top-100-in-the-kaggle-house-prices-competition/" aria-label="Read more about How I Cracked the Top 100 in the Kaggle House Prices Competition">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/how-i-cracked-the-top-100-in-the-kaggle-house-prices-competition/">How I Cracked the Top 100 in the Kaggle House Prices Competition</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Kaggle is a vibrant online community for data science and machine learning, providing a platform for learning, sharing, and competition. It&#8217;s an invaluable resource for individuals interested in these fields, regardless of their level of experience. </p>



<p>The <strong>Kaggle House Prices &#8211; Advanced Regression Techniques Competition</strong>, in particular, is an excellent starting point for anyone who has completed a data science or machine learning course and is eager to gain practical experience. </p>



<p>Participants are tasked with predicting the final price of residential homes in Ames, Iowa, based on 79 explanatory variables describing various aspects of the properties. </p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="947" height="631" src="https://blog.finxter.com/wp-content/uploads/2023/06/image-57.png" alt="" class="wp-image-1420271" srcset="https://blog.finxter.com/wp-content/uploads/2023/06/image-57.png 947w, https://blog.finxter.com/wp-content/uploads/2023/06/image-57-300x200.png 300w, https://blog.finxter.com/wp-content/uploads/2023/06/image-57-768x512.png 768w" sizes="auto, (max-width: 947px) 100vw, 947px" /></figure>
</div>


<p>The variables include a vast array of house attributes such as the type of dwelling, the size of the living area, the number of rooms, the year the house was built, the quality and condition of various features, the neighborhood, and many more. The challenge aims to encourage the application of advanced regression techniques and creative feature engineering to build models that can accurately predict house prices, an important task in real estate analytics.</p>



<p>A couple of years ago, right after finishing an online data science bootcamp, I decided to try my hand at the <strong><em>House Prices</em></strong> competition. I found it equally fun and frustrating. I became obsessed with cracking the top 100 on the leaderboard. After much struggle, I finally made it. The code can be found <a href="https://github.com/finxter/Kaggle-House-Prices-Competition" data-type="URL" data-id="https://github.com/finxter/Kaggle-House-Prices-Competition" target="_blank" rel="noreferrer noopener">here</a>.</p>



<p>I thought it would be fun to revisit this challenge and write an article about it. </p>



<p>After dusting off the code, I found it held up pretty well. It put me in the 130s on the public leaderboard. I figured I’d tweak the code a bit, get back in the top 100 and write my article. Unfortunately, I got stuck just below 110 and found myself trapped in the same cycle:</p>



<ol class="wp-block-list">
<li>Try anything and everything I can think of</li>



<li>Review other notebooks and try everything everyone else thought of</li>



<li>Find my current notebook a bloated mess and hard to work with so I start another one.</li>
</ol>



<p>Finally, I found another notebook someone graciously posted <a rel="noreferrer noopener" href="https://www.kaggle.com/code/thegamer7675/midterm-210045452" target="_blank">here</a> which got a score I was looking for. It took a while to unpack the code. In doing so, I found some things that worked, but I didn’t really know how they were derived or why they worked. The biggest difference I found was this notebook focused much more heavily on feature engineering than I did.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="576" src="https://blog.finxter.com/wp-content/uploads/2023/06/image-58-1024x576.png" alt="" class="wp-image-1420275" srcset="https://blog.finxter.com/wp-content/uploads/2023/06/image-58-1024x576.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/06/image-58-300x169.png 300w, https://blog.finxter.com/wp-content/uploads/2023/06/image-58-768x432.png 768w, https://blog.finxter.com/wp-content/uploads/2023/06/image-58.png 1122w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>After much effort, I finally stumbled upon “3 simple tricks” that I found particularly helpful. I was able to grind my way to the score I wanted while feeling like I actually understood what was going on. Here they are:</p>



<ol class="wp-block-list">
<li>Use a <code>sklearn</code> pipeline and a “train_test” function to organize the code.</li>



<li>Use visualizations and Pandas <code><a href="https://blog.finxter.com/pd-dataframe-groupby-a-simple-illustrated-guide/" data-type="post" data-id="340015" target="_blank" rel="noreferrer noopener">groupby</a></code> queries to brainstorm feature engineering ideas</li>



<li>Use the <code>tpot</code> library to help brainstorm ideas for jazzing up the pipeline and using more advanced models.</li>
</ol>



<p>The full model can be found in this <a href="https://www.kaggle.com/code/onehundreddays/house-prices-top-100-score" target="_blank" rel="noreferrer noopener">kaggle notebook</a>. But I hope you will take a stab at the competition first, then compare your code to mine. I got a kaggle score of .11229 which at this point in time is good enough for rank 84 out of 4742 entries. </p>



<h2 class="wp-block-heading">Getting Started</h2>



<p>The easiest way to get started on the competition is to join the competition and create a notebook within Kaggle. From the competition page, click the code tab and the New Notebook button.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="233" src="https://blog.finxter.com/wp-content/uploads/2023/06/image-52-1024x233.png" alt="" class="wp-image-1420185" srcset="https://blog.finxter.com/wp-content/uploads/2023/06/image-52-1024x233.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/06/image-52-300x68.png 300w, https://blog.finxter.com/wp-content/uploads/2023/06/image-52-768x175.png 768w, https://blog.finxter.com/wp-content/uploads/2023/06/image-52.png 1128w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>The first cell in the new notebook will already be populated. If you run the cell it will show you where to get the data. You can then <a href="https://blog.finxter.com/how-to-convert-tab-delimited-file-to-csv-in-python/" data-type="post" data-id="563635" target="_blank" rel="noreferrer noopener">load the data</a> into Pandas DataFrames using the given locations.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">sample_submission = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/sample_submission.csv")
train = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/train.csv")
test = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/test.csv")</pre>



<p>Next, all we have to do is split the data into the standard X and y for the features and target. Then we will be ready to create the pipeline.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">X = train.drop('SalePrice', axis=1)
y = train[['SalePrice']].copy()
y = np.log1p(y)
</pre>



<p><code>SalePrice</code> is skewed. A handful of very expensive houses extend the right tail. The log of <code>SalePrice</code> is much closer to a normal distribution. </p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="295" src="https://blog.finxter.com/wp-content/uploads/2023/06/image-50-1024x295.png" alt="" class="wp-image-1420184" srcset="https://blog.finxter.com/wp-content/uploads/2023/06/image-50-1024x295.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/06/image-50-300x86.png 300w, https://blog.finxter.com/wp-content/uploads/2023/06/image-50-768x221.png 768w, https://blog.finxter.com/wp-content/uploads/2023/06/image-50.png 1248w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>There is an interesting side effect of building a model on the log-transformed target variable. When you use the log of <code>SalePrice</code> as the response variable, the interpretation of the coefficients changes. </p>



<p>Now, a one-unit increase in a predictor variable corresponds to a percentage change in <code>SalePrice</code>, rather than an absolute change. </p>



<p>So, in the log-transformed model, if the coefficient of a predictor variable is 0.01, then a one-unit increase in that predictor is associated with an approximately 1% increase in <code>SalePrice</code>. </p>



<p>This means a coefficient can work just as well for a $60,000 house as a $600,000 house.</p>



<h2 class="wp-block-heading"><strong>Machine Learning Pipelines: A Key Tool in Model Building</strong></h2>



<p>In the realm of machine learning, Sklearn&#8217;s pipeline is an indispensable tool that simplifies the process of building and evaluating models. It neatly chains together data transformation steps and the machine learning model in a sequence. </p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="420" height="631" src="https://blog.finxter.com/wp-content/uploads/2023/06/image-59.png" alt="" class="wp-image-1420278" srcset="https://blog.finxter.com/wp-content/uploads/2023/06/image-59.png 420w, https://blog.finxter.com/wp-content/uploads/2023/06/image-59-200x300.png 200w" sizes="auto, (max-width: 420px) 100vw, 420px" /></figure>
</div>


<p>When you fit the pipeline, it seamlessly performs the data transformations before fitting the model with the transformed data.</p>



<p>To demonstrate the usage of pipelines, let&#8217;s consider a task. We have our features stored in a DataFrame X and target values in a variable y. Our goal is to create a pipeline that:</p>



<ul class="wp-block-list">
<li>Imputes null values for numerical data with the median</li>



<li>Imputes null values for text data with the most common value</li>



<li>Scales the numeric data with <code>StandardScaler</code></li>



<li>Uses One Hot Encoding on the text data</li>
</ul>



<p>Here&#8217;s how you can implement it:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder


# Identify numeric columns
numeric_columns = X.select_dtypes(include=['number']).columns


# Identify categorical columns
categorical_columns = X.select_dtypes(include=['object']).columns


# Create transformers
numeric_transformer = make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler()
)


categorical_transformer = make_pipeline(
    SimpleImputer(strategy='most_frequent'),
    OneHotEncoder(handle_unknown='ignore')
)


# Combine transformers into a preprocessor step
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns)
    ]
)


# The preprocessor can now be used in a pipeline with a final estimator
# model = make_pipeline(preprocessor, YourModel())
</pre>



<p>This code has three essential parts:</p>



<ol class="wp-block-list">
<li><strong>Identifying the types of columns</strong>: Numeric columns are handled differently from non-numeric ones. We fill null values for numeric columns with the median and for text data with the most common value.</li>



<li><strong>Creating transformers</strong>: We use the <code>make_pipeline</code> function to create a data transformer for each type of column. The numeric transformer imputes values then scales them, and the categorical transformer fills missing data with the most frequent value, then applies One Hot Encoding to the result.</li>



<li><strong>Combining transformers</strong>: We apply different transformers to different columns using the <code>ColumnTransformer</code>.</li>
</ol>



<p>Next, let&#8217;s package this process into a function, <code>train_and_test</code>, which accepts a machine learning model and a data manipulation function as parameters. This allows us to easily test different models and feature engineering approaches.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def train_and_test(model, X, y, data_func=None):
    X_copy = X.copy()
    y_copy = y.copy()


    if data_func:
        data_func(X_copy, y_copy)


    pipe = make_pipeline(
        impute_and_encode(X_copy),
        model
    )


    pipe.fit(X_copy, y_copy)
    evaluate_model(pipe, X, y)
</pre>



<h2 class="wp-block-heading">Evaluating the Model with RMSE</h2>



<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>RMSE</strong> stands for Root Mean Squared Error. Here&#8217;s how it works: for each data point, the model&#8217;s predicted value is subtracted from the actual value to give the prediction error. Each of these errors is then squared and the results are averaged across all data points. Finally, the square root of this average is taken to give the RMSE.</p>



<p>Because the errors are squared before averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE is most useful when large errors are particularly undesirable. Here is the code to evaluate model performance:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def evaluate_model(model, X, y):
    model.fit(X, y)


    rmse_scores = np.sqrt(-cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=5))
    rmse_mean = rmse_scores.mean()


    # Calculate R-squared score using cross validation
    r2_scores = cross_val_score(model, X, y, scoring="r2", cv=5)
    r2_mean = r2_scores.mean()
    print(f'mean RMSE with 5 folds: {rmse_mean}')
    print(f'mean R2: {r2_mean}')
    return rmse_mean, r2_mean</pre>



<p>The basic idea behind cross-validation is to divide the data into a number of subsets, or &#8216;folds&#8217;. </p>



<p>The model is then trained on all but one of these folds and tested on the remaining fold. This process is repeated with each fold serving as the test set once. </p>



<p>This is often referred to as <strong><em>K-fold cross-validation</em></strong>, where K is the number of folds. Cross-validation gives a better measure of how well your model will perform on unseen data than using a single train-test split.</p>



<h2 class="wp-block-heading">Unleashing Exploratory Data Analysis for Feature Engineering</h2>



<p><strong>Feature engineering</strong> is a crucial phase in the model-building process where you transform existing features and create new ones with the aim of enhancing model performance. A great starting point for feature engineering is to get acquainted with the existing features through <strong><a href="https://blog.finxter.com/easy-exploratory-data-analysis-eda-in-python-with-visualization/" data-type="post" data-id="335731" target="_blank" rel="noreferrer noopener">Exploratory Data Analysis (EDA)</a></strong>. Let&#8217;s see how this process can lead us to discover some intriguing insights.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="575" src="https://blog.finxter.com/wp-content/uploads/2023/06/image-60-1024x575.png" alt="" class="wp-image-1420280" srcset="https://blog.finxter.com/wp-content/uploads/2023/06/image-60-1024x575.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/06/image-60-300x168.png 300w, https://blog.finxter.com/wp-content/uploads/2023/06/image-60-768x431.png 768w, https://blog.finxter.com/wp-content/uploads/2023/06/image-60.png 1124w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p class="has-base-2-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Recommended</strong>: <a href="https://blog.finxter.com/easy-exploratory-data-analysis-eda-in-python-with-visualization/" data-type="URL" data-id="https://blog.finxter.com/easy-exploratory-data-analysis-eda-in-python-with-visualization/" target="_blank" rel="noreferrer noopener">Easy Exploratory Data Analysis (EDA) in Python with Visualization</a></p>



<p>A widely used EDA visualization tool is the <a href="https://blog.finxter.com/how-to-make-heatmap-using-pandas-dataframe/" data-type="post" data-id="61559" target="_blank" rel="noreferrer noopener">heatmap</a>, which provides an overview of feature correlations. Let&#8217;s take a closer look at how our features correlate with &#8216;<code>SalePrice</code>&#8216; – the target feature.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">plt.figure(figsize=(4,10))
sns.heatmap(train.corr()[['SalePrice']], annot=True)
plt.title('Correlations with SalePrice')
plt.show()
</pre>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="480" height="754" src="https://blog.finxter.com/wp-content/uploads/2023/06/image-53.png" alt="" class="wp-image-1420188" srcset="https://blog.finxter.com/wp-content/uploads/2023/06/image-53.png 480w, https://blog.finxter.com/wp-content/uploads/2023/06/image-53-191x300.png 191w" sizes="auto, (max-width: 480px) 100vw, 480px" /></figure>
</div>


<p class="has-base-2-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Recommended</strong>: <a href="https://blog.finxter.com/heatmaps-with-seaborn/" data-type="post" data-id="19568" target="_blank" rel="noreferrer noopener">Creating Beautiful Heatmaps with Seaborn</a></p>



<p>A notable anomaly in this heatmap is the feature &#8216;<code>OverallCond</code>&#8216;, which denotes the overall condition of the house on a scale of 1 to 10 (10 being the best). </p>



<p>Intuitively, we&#8217;d expect houses in better condition to fetch higher prices, translating to a strong positive correlation. But surprisingly, &#8216;<code>OverallCond</code>&#8216; demonstrates a meager correlation of -0.037 with &#8216;<code>SalePrice</code>&#8216;.</p>



<p>This presents an exciting puzzle – can we improve the model&#8217;s performance by modifying &#8216;<code>OverallCond</code>&#8216;, crafting a new feature, or simply discarding it? With our pipeline and <code>train_and_test</code> function set up, testing these alternatives is a breeze.</p>



<p>Before we proceed, let&#8217;s visualize &#8216;<code>OverallCond</code>&#8216; vs &#8216;<code>SalePrice</code>&#8216; on a scatter plot:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="529" height="384" src="https://blog.finxter.com/wp-content/uploads/2023/06/image-49.png" alt="" class="wp-image-1420183" srcset="https://blog.finxter.com/wp-content/uploads/2023/06/image-49.png 529w, https://blog.finxter.com/wp-content/uploads/2023/06/image-49-300x218.png 300w" sizes="auto, (max-width: 529px) 100vw, 529px" /></figure>
</div>


<p>The plot seems to suggest a positive correlation, contradicting the correlation matrix. A peek at the histogram of &#8216;<code>OverallCond</code>&#8216; unveils that the majority of houses have a value of 5.</p>



<p><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Let&#8217;s posit a hypothesis &#8211; <strong>Could the age of the house influence how &#8216;<code>OverallCond</code>&#8216; affects &#8216;<code>SalePrice</code>&#8216;?</strong></p>



<p>Let&#8217;s divide our data into older and newer houses (built before and after 1980, respectively) and plot them against &#8216;<code>SalePrice</code>&#8216;.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">older_house = X.YearBuilt &lt; 1980
plot = sns.scatterplot(x=X.OverallCond, y=train.SalePrice, hue=older_house)
legend = plot.legend_
legend.set_title("Built before 1980")
plt.show()
</pre>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="539" height="382" src="https://blog.finxter.com/wp-content/uploads/2023/06/image-51.png" alt="" class="wp-image-1420186" srcset="https://blog.finxter.com/wp-content/uploads/2023/06/image-51.png 539w, https://blog.finxter.com/wp-content/uploads/2023/06/image-51-300x213.png 300w" sizes="auto, (max-width: 539px) 100vw, 539px" /></figure>
</div>


<p>Interesting! It appears that for newer houses, &#8216;<code>OverallCond</code>&#8216; generally receives a default value of 5. For older houses, however, the &#8216;<code>OverallCond</code>&#8216; rating seems to matter more.</p>



<p>To capitalize on this observation, we&#8217;ll create a new feature, &#8216;<code>HouseAge</code>&#8216;, to represent the age of the house, and another, &#8216;<code>AgeCond</code>&#8216;, to capture the interaction between &#8216;<code>HouseAge</code>&#8216; and &#8216;<code>OverallCond</code>&#8216;.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def house_age(X, y):
    X['HouseAge'] = X.YrSold - X.YearBuilt
    X['AgeCond'] = X.HouseAge * X.OverallCond


train_and_test(LinearRegression(), house_age)
</pre>



<p>Incorporating these changes leads to a reduction in the RMSE from .1566 to .1562. While most experiments might not bear fruit and successful ones may bring minor improvements, persisting with this iterative process will gradually lead you to a well-performing model.</p>



<h2 class="wp-block-heading">Error Residuals for Feature Creation</h2>



<p class="has-global-color-8-background-color has-background"><strong>Error residuals</strong>, simply referred to as residuals, depict the gap between the actual and predicted values of a data point. In essence, it&#8217;s the enigmatic segment of your model&#8217;s prediction. In the realm of linear regression, it&#8217;s calculated as e = y &#8211; ŷ, where &#8216;y&#8217; denotes the observed value, and &#8216;ŷ&#8217; represents the predicted value from your model. </p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="420" height="631" src="https://blog.finxter.com/wp-content/uploads/2023/06/image-61.png" alt="" class="wp-image-1420281" srcset="https://blog.finxter.com/wp-content/uploads/2023/06/image-61.png 420w, https://blog.finxter.com/wp-content/uploads/2023/06/image-61-200x300.png 200w" sizes="auto, (max-width: 420px) 100vw, 420px" /></figure>
</div>


<p>A healthy model ideally has normally distributed and random residuals. By uncovering patterns within these errors, we can pinpoint the model&#8217;s blind spots, fueling us with novel feature creation ideas.</p>



<p>To illuminate this, let&#8217;s first establish a function to predict:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def generate_predictions(model, data_func=None):
    X_copy = X.copy()
    y_copy = y.copy()


    if data_func:
        data_func(X_copy)


    pipe = make_pipeline(
      data_preprocessor(X_copy),
      model
    )


    pipe.fit(X_copy, y_copy)
    predictions = pipe.predict(X_copy)
    return predictions
</pre>



<p>With predictions in hand, we calculate and visualize the residuals:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">predicted_prices = generate_predictions(LinearRegression(), house_age)
residuals = y.SalePrice - predicted_prices


plt.plot(range(len(y)), residuals, 'bo', alpha=.5)
plt.title('Error Residuals')
plt.xlabel('House Index')
plt.ylabel('Residual Value')
plt.show()
</pre>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="576" height="408" src="https://blog.finxter.com/wp-content/uploads/2023/06/image-55.png" alt="" class="wp-image-1420190" srcset="https://blog.finxter.com/wp-content/uploads/2023/06/image-55.png 576w, https://blog.finxter.com/wp-content/uploads/2023/06/image-55-300x213.png 300w" sizes="auto, (max-width: 576px) 100vw, 576px" /></figure>
</div>


<p>The larger negative residuals represent cases where the model way over predicted <code>SalePrice</code>. We can look at these houses and see if we can find some new information that will help the model predict lower prices. We are looking for something negative about these houses that the model didn’t see.</p>



<p>A quick scan reveals that these unpredictable homes often have an OverallQual rating below 5, and a <code>SaleCondition</code> that is not &#8220;Normal&#8221;.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">train.loc[np.abs(residuals.SalePrice) > 0.4, ['SaleCondition', 'OverallQual', 'SalePrice']]</pre>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="266" height="261" src="https://blog.finxter.com/wp-content/uploads/2023/06/image-48.png" alt="" class="wp-image-1420182"/></figure>
</div>


<p>Utilizing the <code><a href="https://blog.finxter.com/pd-dataframe-groupby-a-simple-illustrated-guide/" data-type="post" data-id="340015" target="_blank" rel="noreferrer noopener">groupby</a></code> function of Pandas, we compare median prices for true versus false conditions, ideally spotting substantial price differences with a reasonable record count for each condition:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">train.groupby((train.OverallQual &lt; 5)).agg(dict(SalePrice=['median', 'count']))</pre>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="188" height="131" src="https://blog.finxter.com/wp-content/uploads/2023/06/image-54.png" alt="" class="wp-image-1420189"/></figure>
</div>


<p>We can easily modify the code to test similar conditions</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">fltr = (train.SaleCondition=='Abnorml') &amp; (train.OverallQual &lt; 5)
train.groupby(fltr).agg(dict(SalePrice=['median', 'count']))</pre>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="158" height="113" src="https://blog.finxter.com/wp-content/uploads/2023/06/image-47.png" alt="" class="wp-image-1420181"/></figure>
</div>


<p>Now we can create a new feature and see if it helps the model.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def create_new_features(X, y):
    X['HouseAge'] = X.YrSold - X.YearBuilt
    X['AgeCond'] = X.HouseAge * X.OverallCond
    X['QuirkyCondition'] = (X.SaleCondition=='Abnorml') &amp; (X.OverallQual &lt; 5)


train_and_test(LinearRegression(), create_new_features)
</pre>



<p>The results? <strong>A tad better RMSE: mean RMSE with 5 folds: 0.1559</strong>. Another small victory. After every model modification, the residuals change, granting you another opportunity to analyze and iterate.</p>



<h2 class="wp-block-heading">Leveraging Integer Encoding for Categorical Features</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="420" height="631" src="https://blog.finxter.com/wp-content/uploads/2023/06/image-62.png" alt="" class="wp-image-1420284" srcset="https://blog.finxter.com/wp-content/uploads/2023/06/image-62.png 420w, https://blog.finxter.com/wp-content/uploads/2023/06/image-62-200x300.png 200w" sizes="auto, (max-width: 420px) 100vw, 420px" /></figure>
</div>


<p><strong>One Hot encoding</strong> is a popular technique for transforming categorical variables into binary features, especially when there&#8217;s no inherent order in the categories and their count is relatively small. </p>



<p>However, for ordinal features like <code>OverallQual</code>, where the categories follow a natural progression from &#8220;Poor&#8221; to &#8220;Excellent&#8221;, <strong>Integer (or Ordinal) Encoding</strong> would be more appropriate.</p>



<p>Here&#8217;s how to perform Integer Encoding on a feature:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def find_category_mappings(df, variable, target):  
  # first  we generate an ordered list with the labels
  ordered_labels = df.groupby([variable])[target].median().sort_values().index


  # return the dictionary with mappings
  return {k: i for i, k in enumerate(ordered_labels, 0)}
 
def integer_encode(df, feature):
    mapping = find_category_mappings(train, feature, 'SalePrice')
    df[feature] = df[feature].map(mapping)
</pre>



<p>The above functions rank feature values based on the median <code>SalePrice</code>, replacing them with their respective ranks. Consequently, unordered categorical features morph into meaningful ordinal features.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def ordinal_encode_features(X):
    integer_encode(X, 'BsmtQual')
    integer_encode(X, 'BsmtCond')
    # ... lots of others omitted for brevity ...
    integer_encode(X, 'GarageQual')
    integer_encode(X, 'GarageCond')</pre>



<p>Ordinal Encoding is particularly useful when a categorical feature has many unique values or when creating interaction terms with that feature. </p>



<p>The &#8216;<code>Neighborhood</code>&#8216; feature is an excellent case in point. A more affluent neighborhood might have distinctive preferences for various features, which we can capture by creating interaction terms, multiplying the integer-encoded &#8216;<code>Neighborhood</code>&#8216; field with those features.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def neighborhood_features(X):
    X['Hood2'] = X['Neighborhood'].values
    integer_encode(X, 'Neighborhood')
   
    # neighborhood interactions
    X['HoodQual'] = X.Neighborhood * X.OverallQual
    X['HoodQual3'] = X.Neighborhood * X.BsmtQual
    # ... [add the other interaction terms here] ...
    X['HoodRooms'] = X.Neighborhood * X.TotRmsAbvGrd
    X['HoodRooms2'] = X.GrLivArea * X.BedroomAbvGr


def data_prep(X, y):
    X['HouseAge'] = X.YrSold - X.YearBuilt
    X['AgeCond'] = X.HouseAge * X.OverallCond
    X['QuirkyCondition'] = (X.SaleCondition=='Abnorml') &amp; (X.OverallQual &lt; 5)
    ordinal_encode_features(X)
    neighborhood_features(X)


train_and_test(RidgeCV(), data_prep)
</pre>



<p>And the result? <strong>A significant improvement in the RMSE score: mean RMSE with 5 folds: 0.1380.</strong> </p>



<p>Note that we used the <code>RidgeCV</code> model this time. Ridge regression is suitable when your data exhibits multicollinearity (high correlations among predictor variables), and it can help mitigate overfitting. </p>



<p>Attempting the same with <code><a href="https://blog.finxter.com/python-linear-regression-1-liner/" data-type="post" data-id="1920" target="_blank" rel="noreferrer noopener">LinearRegression</a></code> resulted in an unsatisfactory outcome, indicating it&#8217;s time to explore more sophisticated models.</p>



<h2 class="wp-block-heading">Exploring Advanced Models and Transformers Using TPOT</h2>



<p class="has-global-color-8-background-color has-background"><strong>Tree-based Pipeline Optimization Tool (TPOT)</strong> is a Python library designed to automate the construction and optimization of machine learning pipelines. It uses genetic programming to ease the process of building complex models, especially beneficial for practitioners with limited machine learning expertise.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="576" src="https://blog.finxter.com/wp-content/uploads/2023/06/image-63-1024x576.png" alt="" class="wp-image-1420285" srcset="https://blog.finxter.com/wp-content/uploads/2023/06/image-63-1024x576.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/06/image-63-300x169.png 300w, https://blog.finxter.com/wp-content/uploads/2023/06/image-63-768x432.png 768w, https://blog.finxter.com/wp-content/uploads/2023/06/image-63.png 1122w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>TPOT treats the pipeline creation as a search problem, exploring through various data pre-processing steps, feature selection techniques, model selections, and hyperparameter choices, aiming to find the optimal pipeline that maximizes the performance on your dataset.</p>



<p>It&#8217;s worth noting that running TPOT might take some time, but the insights obtained from its suggestions can be valuable. Particularly, it provides initial values for model hyperparameters, which can offer a significant advantage during the hyperparameter tuning process.</p>



<p>First step is to create a <code>TPOTRegressor</code> object:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from tpot import TPOTRegressor
tpot = TPOTRegressor(generations=5, population_size=20, verbosity=2)</pre>



<p>The <code>TPOTRegressor</code> is designed specifically for regression tasks. </p>



<p>The <code>generations</code> parameter indicates the number of rounds the algorithm should run to find the best pipeline; a higher number typically implies a slower but potentially more accurate outcome. </p>



<p><code>population_size</code> informs the algorithm on the number of pipelines to explore, and verbosity sets the level of output information. </p>



<p>Keep in mind that running TPOT can be time-consuming, especially as it&#8217;s applied across five cross-validation folds in the <code>train_and_test</code> function.</p>



<p>For instance, here is a TPOT recommendation:</p>



<p><strong>Best pipeline:</strong> <code>ExtraTreesRegressor(LassoLarsCV(input_matrix, normalize=False), bootstrap=False, max_features=0.8, min_samples_leaf=1, min_samples_split=3, n_estimators=100)</code></p>



<p>Interpreting this, you start from the center and work outward. Thus, TPOT suggests a pipeline comprising two steps:</p>



<ol class="wp-block-list">
<li><code>LassoLarsCV(input_matrix, normalize=False)</code></li>



<li><code>ExtraTreesRegressor(bootstrap=False, max_features=0.8, min_samples_leaf=1, min_samples_split=3, n_estimators=100)</code></li>
</ol>



<p>However, there&#8217;s a caveat. </p>



<p>A pipeline can only end with a machine learning model, and all previous steps must be transformers. Hence, not all suggestions directly fit the standard Scikit-Learn pipeline structure. </p>



<p>What if tpot recommends two machine learning models in its recommended pipeline? </p>



<p>You can stack them. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f447.png" alt="👇" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<h2 class="wp-block-heading">Stacking Machine Learning Models</h2>



<p class="has-global-color-8-background-color has-background">Stacking is a technique where predictions of individual models are used as input for a final model (also known as meta-learner) to make a final prediction. Scikit-Learn offers a <code>StackingRegressor</code> for this purpose.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="377" height="631" src="https://blog.finxter.com/wp-content/uploads/2023/06/image-64.png" alt="" class="wp-image-1420289" srcset="https://blog.finxter.com/wp-content/uploads/2023/06/image-64.png 377w, https://blog.finxter.com/wp-content/uploads/2023/06/image-64-179x300.png 179w" sizes="auto, (max-width: 377px) 100vw, 377px" /></figure>
</div>


<p>To use the <code>StackingRegressor</code>, we first need to initialize the base models and the final model.</p>



<p>Here&#8217;s an example:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># Initialize the base models
base_models = [
    ('lassolarscv', LassoLarsCV(normalize=False)),
    ('extratrees', ExtraTreesRegressor(bootstrap=False, max_features=0.8, min_samples_leaf=1, min_samples_split=3, n_estimators=100))
]


# Initialize the final model
final_model = LinearRegression()


# Create the stacking regressor
stack2 = StackingRegressor(
    estimators=base_models,
    final_estimator=final_model
)


train_and_test(stack2, data_prep)
</pre>



<p>Using this model, the RMSE has dropped to .1294, a pretty significant improvement.</p>



<h2 class="wp-block-heading">Adding Scalers and Feature Selectors to the Pipeline</h2>



<p>Machine learning pipelines can incorporate <strong>scalers </strong>and <strong>feature selectors</strong> for improved results. </p>



<ul class="wp-block-list">
<li><strong>Scalers </strong>transform the data to fit within a certain scale like standard deviation or minimum and maximum values, improving the performance of some machine learning models. </li>



<li><strong>Feature selectors</strong>, on the other hand, can be used to reduce the dimensionality of the data by selecting the most important features.</li>
</ul>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="420" height="631" src="https://blog.finxter.com/wp-content/uploads/2023/06/image-65.png" alt="" class="wp-image-1420292" srcset="https://blog.finxter.com/wp-content/uploads/2023/06/image-65.png 420w, https://blog.finxter.com/wp-content/uploads/2023/06/image-65-200x300.png 200w" sizes="auto, (max-width: 420px) 100vw, 420px" /></figure>
</div>


<p>Here is a recommendation from <code>tpot</code> that includes a scaler:</p>



<p><strong>Best Pipeline:</strong> <code>XGBRegressor(ElasticNetCV(RobustScaler(input_matrix), l1_ratio=0.1, tol=0.001), learning_rate=0.1, max_depth=9, min_child_weight=6, n_estimators=100, n_jobs=1, objective=reg:squarederror, subsample=0.35000000000000003, verbosity=0)</code></p>



<p>And here’s one that recommends a feature selector:</p>



<p><strong>Best pipeline</strong>: <code>RandomForestRegressor(VarianceThreshold(LassoLarsCV(input_matrix, normalize=False), 0.028), bootstrap=False, max_features=0.4, min_samples_leaf=9, min_samples_split=19, n_estimators=100)</code></p>



<p>Let’s try out these ideas.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from sklearn.preprocessing import RobustScaler
from sklearn.feature_selection import SelectFwe


def train_and_test(model, data_func=None):
    # use copies so original data isn't changed
    X_copy = X.copy()
    y_copy = y.copy()


    if data_func:
        data_func(X_copy, y_copy)


    pipe = make_pipeline(
    get_preprocessor(X_copy),
    RobustScaler(),
    VarianceThreshold(.028),
    model
  )
    pipe.fit(X_copy, y_copy)
    evaluate_model(pipe, X_copy, y_copy)


train_and_test(stack2, data_prep)
</pre>



<p>Another improvement!&nbsp;</p>



<h2 class="wp-block-heading">Conclusion</h2>



<p>This blog post has delved into several powerful tools and strategies that I leveraged to improve my ranking in the Kaggle House Prices competition. Here, we revisited:</p>



<ol class="wp-block-list">
<li>The use of Pipelines and a robust &#8220;<code>train_and_test</code>&#8221; function to streamline the model training and evaluation process, fostering cleaner, more manageable code.</li>



<li>The exploration of <strong>Pandas and Seaborn</strong> libraries for brainstorming and creating new features. Data visualization, summary statistics, and feature engineering are crucial in building a comprehensive understanding of your dataset and in finding innovative ways to extract more predictive power from it.</li>



<li>The deployment of <strong>TPOT, a Python Automated Machine Learning</strong> tool that optimizes machine learning pipelines using genetic programming. It&#8217;s a great resource to generate ideas for models, transformers, and pipeline configurations.</li>
</ol>



<p>The key is to foster a productive cycle of idea generation and rapid testing. Ensuring a clean and organized codebase can significantly ease this process. It might be a bit challenging initially, as it was for me, especially when dealing with bloated notebooks that seem impossible to debug or optimize. </p>



<p>However, with perseverance and the right approach, you can turn this into an enjoyable and highly rewarding journey. </p>



<p>Over time, you will find yourself becoming more adept at navigating through these challenges and devising effective solutions, leading to better results and a deeper understanding of machine learning concepts.</p>



<p>Also check out my other article you&#8217;ll probably enjoy:</p>



<p class="has-base-2-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Recommended</strong>: <a href="https://blog.finxter.com/heatmaps-with-seaborn/" data-type="URL" data-id="https://blog.finxter.com/heatmaps-with-seaborn/" target="_blank" rel="noreferrer noopener">How I Scattered My Fat with Python – Scraping and Analyzing My Nutrition Data From Cronometer.com</a></p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://blog.finxter.com/heatmaps-with-seaborn/" target="_blank" rel="noreferrer noopener"><img loading="lazy" decoding="async" width="818" height="563" src="https://blog.finxter.com/wp-content/uploads/2023/06/image-56.png" alt="" class="wp-image-1420233" srcset="https://blog.finxter.com/wp-content/uploads/2023/06/image-56.png 818w, https://blog.finxter.com/wp-content/uploads/2023/06/image-56-300x206.png 300w, https://blog.finxter.com/wp-content/uploads/2023/06/image-56-768x529.png 768w" sizes="auto, (max-width: 818px) 100vw, 818px" /></a></figure>
</div><p>The post <a href="https://blog.finxter.com/how-i-cracked-the-top-100-in-the-kaggle-house-prices-competition/">How I Cracked the Top 100 in the Kaggle House Prices Competition</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How I Scattered My Fat with Python &#8211; Scraping and Analyzing My Nutrition Data From Cronometer.com</title>
		<link>https://blog.finxter.com/how-i-scraped-my-nutrition-data-from-cronometer-com/</link>
		
		<dc:creator><![CDATA[Charles Blue]]></dc:creator>
		<pubDate>Thu, 23 Mar 2023 09:16:08 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Data Visualization]]></category>
		<category><![CDATA[Health]]></category>
		<category><![CDATA[Pandas Library]]></category>
		<category><![CDATA[Projects]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Web Scraping]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=1236413</guid>

					<description><![CDATA[<p>From April 1st through August 14th, I tracked everything I ate on cronometer.com as part of a weight loss challenge. Overall I lost almost 25 pounds at a rate of 1.2 pounds per week. I always wondered what I could learn if I could scrape that data and get it into a Jupyter Notebook. In ... <a title="How I Scattered My Fat with Python &#8211; Scraping and Analyzing My Nutrition Data From Cronometer.com" class="read-more" href="https://blog.finxter.com/how-i-scraped-my-nutrition-data-from-cronometer-com/" aria-label="Read more about How I Scattered My Fat with Python &#8211; Scraping and Analyzing My Nutrition Data From Cronometer.com">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/how-i-scraped-my-nutrition-data-from-cronometer-com/">How I Scattered My Fat with Python &#8211; Scraping and Analyzing My Nutrition Data From Cronometer.com</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>From April 1st through August 14th, I tracked everything I ate on <a href="https://cronometer.com/" data-type="URL" data-id="https://cronometer.com/" target="_blank" rel="noreferrer noopener">cronometer.com</a> as part of a weight loss challenge. Overall I lost almost 25 pounds at a rate of 1.2 pounds per week. </p>



<p>I always wondered what I could learn if I could <strong>scrape that data and get it into a Jupyter Notebook</strong>. In this article, I will analyze the data and hopefully demonstrate the value of scraping and analyzing personal data.</p>



<h2 class="wp-block-heading">Why cronometer.com is useful for tracking dietary information</h2>



<p>Cronometer allows you to track your foods, biometric data, exercise, and notes. It will calculate calories and a whole host of nutritional information related to vitamins, minerals, macronutrients, amino acids, etc. It will even allow you to track important nutrient ratios such as Omega-6 to <a href="https://weirdsmoothies.com/pimp-your-smoothie-dr-gregers-daily-dozens/" data-type="URL" data-id="https://weirdsmoothies.com/pimp-your-smoothie-dr-gregers-daily-dozens/" target="_blank" rel="noreferrer noopener">Omega-3</a>, Potassium to Sodium, and Calcium to Magnesium. </p>



<p>Here is a sample of the diary page:</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="410" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-250-1024x410.png" alt="" class="wp-image-1236418" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-250-1024x410.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/03/image-250-300x120.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-250-768x308.png 768w, https://blog.finxter.com/wp-content/uploads/2023/03/image-250-1536x615.png 1536w, https://blog.finxter.com/wp-content/uploads/2023/03/image-250.png 1600w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>A handy summary of calories consumed, burned and remaining</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="216" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-251-1024x216.png" alt="" class="wp-image-1236419" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-251-1024x216.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/03/image-251-300x63.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-251-768x162.png 768w, https://blog.finxter.com/wp-content/uploads/2023/03/image-251.png 1205w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>Calories burned are based on your <strong>Basal Metabolic Rate</strong>, an estimate of calories burned based on your average daily activity level and the exercise you entered. On this day, I had 387 calories remaining, which means I had a calorie deficit of 387, which is a good day if you’re trying to lose weight. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4aa.png" alt="💪" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<p>The diary also displays a great deal of nutrient information, including vitamins, minerals, protein including amino acids, carbohydrates, and fats.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="736" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-252-1024x736.png" alt="" class="wp-image-1236422" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-252-1024x736.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/03/image-252-300x216.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-252-768x552.png 768w, https://blog.finxter.com/wp-content/uploads/2023/03/image-252.png 1177w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>It shows the overall nutrition information for the day as a whole, and for each item in the food diary. Much information is just sitting there, waiting to be harvested.</p>



<h2 class="wp-block-heading">Tools used to scrape the data</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="907" height="907" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-271.png" alt="" class="wp-image-1236523" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-271.png 907w, https://blog.finxter.com/wp-content/uploads/2023/03/image-271-300x300.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-271-150x150.png 150w, https://blog.finxter.com/wp-content/uploads/2023/03/image-271-768x768.png 768w" sizes="auto, (max-width: 907px) 100vw, 907px" /></figure>
</div>


<p>To scrape data off of an interactive site like cronometer, you need a tool that will automate interacting with the site. </p>



<p>The tool I used for automation was <a rel="noreferrer noopener" href="https://blog.finxter.com/how-to-get-the-text-with-selenium-in-python/" data-type="post" data-id="36873" target="_blank">Selenium</a>. </p>



<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Selenium</strong> was great for logging in, navigating the calendar to move from day to day, and right-clicking items in the food diary to get to the detailed information. However, I used the <code><a href="https://blog.finxter.com/how-to-read-html-tables-with-pandas/" data-type="post" data-id="590893">read_html()</a></code> function from the Pandas module to extract the data from the web page. </p>



<p>Pandas was also the main tool for the data analysis with some graphs in Seaborn. The full code can be found on the GitHub page <a rel="noreferrer noopener" href="https://github.com/finxter/DietChallenge" data-type="URL" data-id="https://github.com/finxter/DietChallenge" target="_blank">here</a>.</p>



<h2 class="wp-block-heading">Working with Selenium</h2>



<p>First the imports.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys</pre>



<p>Seems like a lot of imports, but they are all necessary. The central object is the web driver. It will open a browser of your choice and automate it. So the nice thing is you can see the browser while the code is running and after. I chose Firefox for the browser. I just found it to be the easiest to work with.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">URL = 'https://cronometer.com/login/'

def get_driver(url):
    driver = webdriver.Firefox()
    driver.get(URL)
    driver.maximize_window()
    driver.implicitly_wait(5)
    set_viewport_size(driver, 1920, 3200)
    
    return driver</pre>



<p>Let’s look at <code>driver.implicitly_wait(5)</code>.  </p>



<p>The <code>implicitly_wait</code> function is used to set a default time for the driver to wait before throwing a <code>NoSuchElementException</code>. </p>



<p>Modern websites rely on code to run before all the elements are loaded. If your Selenium code gets ahead of the code behind the web page, you can get hit with the <code>NoSuchElementException</code>. So this default waiting time will help avoid this problem. However, there will also be times when we will also want to use explicit waits as well.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="787" height="938" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-272.png" alt="" class="wp-image-1236526" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-272.png 787w, https://blog.finxter.com/wp-content/uploads/2023/03/image-272-252x300.png 252w, https://blog.finxter.com/wp-content/uploads/2023/03/image-272-768x915.png 768w" sizes="auto, (max-width: 787px) 100vw, 787px" /></figure>
</div>


<p>Now a few words about the <code>set_viewport_size</code> function, but first I will take a deep breath and spend a few moments in my happy place. </p>



<p>The viewport refers to the visible area of the web page in your browser. So if you try to interact with an element that is not in the viewport, you will get an error. </p>



<p>My first attempt to resolve this was to scroll to each element then move to the element before trying to interact with it. And this worked, most of the time. But it would occasionally error on different elements each time. Very frustrating! </p>



<p>But eventually, I discovered that you can set the size of the viewport. By setting the size large enough, the problem was resolved. </p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def set_viewport_size(driver, width, height):
    window_size = driver.execute_script("""
        return [window.outerWidth - window.innerWidth + arguments[0],
          window.outerHeight - window.innerHeight + arguments[1]];
        """, width, height)
    driver.set_window_size(*window_size)


set_viewport_size(driver, 1920, 3200)</pre>



<p>Notice that with <code>driver.execute_script</code> we can run <a href="https://blog.finxter.com/javascript-developer-income-and-opportunity/" data-type="post" data-id="191233" target="_blank" rel="noreferrer noopener">Javascript</a> on the browser. This can be very useful. </p>



<h2 class="wp-block-heading">Logging in to a web site with Selenium</h2>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def log_in(driver):
    user_name = driver.find_element(By.NAME, 'username')
    password = driver.find_element(By.NAME, 'password')
    login = driver.find_element(By.ID, 'login-button')


    user_name.send_keys('email@email.com')
    password.send_keys('***********************')
    login.click()


    # go to the diary page
    click_button(driver, DIARY_XPATH)
</pre>



<p>The <code>By</code> object is used to tell the driver how to find the element you want. </p>



<p>If you are lucky, the element can be uniquely defined by a name or an id as in this case. Filling in a form element is easy. You can just use the <code>element.send_keys</code> method.</p>



<p>Clicking the login button was a bit more complicated because I found the need to use an explicit wait to make extra sure the element is there before trying to click it.&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">DIARY_XPATH = '//span[contains(text(), "Diary")]'


def click_button(driver, button_xpath):
    try:
        button = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, button_xpath)))
    except Exception as e:
        print('error trying to click button', button_xpath)
        print(e)


    webdriver.ActionChains(driver).move_to_element(button).click(button).perform()
</pre>



<p>The <code>ActionChains</code> object allows you to chain multiple actions to an element in one statement. In this case, I move to the element before clicking it.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="907" height="607" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-273.png" alt="" class="wp-image-1236529" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-273.png 907w, https://blog.finxter.com/wp-content/uploads/2023/03/image-273-300x201.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-273-768x514.png 768w" sizes="auto, (max-width: 907px) 100vw, 907px" /></figure>
</div>


<p>What is an <code>XPATH</code>? It’s a web scraper&#8217;s best friend and worst nightmare. From ChatGPT:</p>



<p><em>XPath is a query language used to traverse XML and HTML documents. In Selenium, XPath can be used to identify elements on a webpage by navigating the document&#8217;s hierarchy of nodes.</em></p>



<p><em>XPath is based on a set of rules for traversing the document tree. The tree consists of nodes, which can be either elements, attributes, text, or comments. XPath expressions are used to select nodes or sets of nodes in the tree, based on their relationship to other nodes.</em></p>



<p>In our example <code>//span[(contains(text(), 'Diary')]</code> can be unpacked:</p>



<ul class="wp-block-list">
<li><code>//span</code> returns all span elements in the document, regardless of location</li>



<li>Brackets are used to filter elements</li>



<li>The <code>text()</code> function returns the text associated with the element</li>



<li>The <code>contains(text, look for this)</code> means look for this anywhere within the text</li>



<li>Putting it all together <code>span[(contains(text(), 'Diary')]</code> means give me all span elements that have <code>‘Diary’</code> anywhere within their text. Luckily, in this case, there is only one element</li>
</ul>



<p>So in our example, the <code>XPATH</code> is pretty short and identifies only the desired element. So how can <code>XPATH</code> become a nightmare? When I tried to create an <code>XPATH</code> to identify only the vitamin elements on a page. </p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="581" height="451" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-253.png" alt="" class="wp-image-1236432" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-253.png 581w, https://blog.finxter.com/wp-content/uploads/2023/03/image-253-300x233.png 300w" sizes="auto, (max-width: 581px) 100vw, 581px" /></figure>
</div>


<p>Here the <code>XPATH</code> quickly becomes complicated. And I was able to find an expression that effectively filtered only the vitamins for one particular record. However, after running the web scraping process, which takes quite a long time, I found a few records where the data was just wrong. </p>



<p>If you right-click on the web page and choose to inspect, it will bring up the developer tools window. Then you can hit <code>control-f</code> to bring up a search box. This is how you can test your <code>XPATH</code> to see what it returns. </p>



<p>For example:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="594" height="225" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-254.png" alt="" class="wp-image-1236433" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-254.png 594w, https://blog.finxter.com/wp-content/uploads/2023/03/image-254-300x114.png 300w" sizes="auto, (max-width: 594px) 100vw, 594px" /></figure>
</div>


<p>Here I am searching for all HTML elements in the DOM. </p>



<p>Why do I get back 5 elements, shouldn’t there be just one? It turns out there are entire HTML documents embedded within the DOM. And their data doesn’t necessarily match what you see on the screen. </p>



<p>And sometimes the <code>XPATH</code> expression was pulling data that didn’t match what was displayed. This means the data was wrong. </p>



<p>Often these documents were embedded within iFrame elements.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="791" height="82" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-255.png" alt="" class="wp-image-1236436" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-255.png 791w, https://blog.finxter.com/wp-content/uploads/2023/03/image-255-300x31.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-255-768x80.png 768w" sizes="auto, (max-width: 791px) 100vw, 791px" /></figure>
</div>


<p>I tried filtering out the iFrames, but nothing I did worked 100% of the time. So how did I end up scraping the actual data? With my old friend Pandas.</p>



<h2 class="wp-block-heading">Scraping data with Pandas</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="907" height="604" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-274.png" alt="" class="wp-image-1236530" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-274.png 907w, https://blog.finxter.com/wp-content/uploads/2023/03/image-274-300x200.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-274-768x511.png 768w" sizes="auto, (max-width: 907px) 100vw, 907px" /></figure>
</div>


<p>Pandas has a <code>read_html</code> method that is very powerful and simple to use. All you have to do is feed it <code>driver.page_source</code> and it returns a list of DataFrames. This is very convenient because DataFrames are what I used for data cleaning and data analysis. </p>



<p>The <code><a rel="noreferrer noopener" href="https://blog.finxter.com/reading-and-writing-html-with-pandas/" data-type="post" data-id="37082" target="_blank">read_html()</a></code> method searches for data in tables and is smart enough to only give you the desired data. Fortunately, all the data I need is stored in tables. </p>



<p>For example, on the diary page, the daily USRDA data is stored in 6 tables under the headers: </p>



<ul class="wp-block-list">
<li>General, </li>



<li>Carbohydrates, </li>



<li>Lipids, </li>



<li>Protein, </li>



<li>Vitamins and </li>



<li>Minerals.</li>
</ul>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="725" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-256-1024x725.png" alt="" class="wp-image-1236442" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-256-1024x725.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/03/image-256-300x213.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-256-768x544.png 768w, https://blog.finxter.com/wp-content/uploads/2023/03/image-256.png 1180w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>First step is to get the list of DataFrames:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">tables = pd.read_html(driver.page_source)
print(f'{len(tables)} tables found')
print('shapes: ', end='')
for i in range(len(tables)):
    print(tables[i].shape, end=' ')
</pre>



<p>Output:</p>



<pre class="wp-block-preformatted"><code>10 tables found
shapes: (26, 8) (5, 4) (6, 4) (9, 4) (13, 4) (13, 4) (11, 4) (10, 7) (1, 5) (7, 7)</code></pre>



<p></p>



<p>The data we want is in tables with ids 1 &#8211; 6. So we just need to <a href="https://blog.finxter.com/how-does-pandas-concat-work/" data-type="post" data-id="17172" target="_blank" rel="noreferrer noopener">concatenate the tables</a> and filter out the data we don’t want.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">nutrients = pd.concat(tables[1:6])
nutrients.columns = ['item', 'quantity', 'units', 'percent_rda']
nutrients = nutrients.dropna()
nutrients = nutrients[nutrients.percent_rda.str.contains('%')]
nutrients.head()
</pre>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="282" height="168" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-257.png" alt="" class="wp-image-1236447"/></figure>
</div>


<p>By default <code>pd.concat</code> stacks DataFrames vertically. The <code><a href="https://blog.finxter.com/pandas-dataframe-dropna-method/" data-type="post" data-id="343814" target="_blank" rel="noreferrer noopener">dropna()</a></code> method removes rows that have empty values. </p>



<p>The next line uses <a rel="noreferrer noopener" href="https://blog.finxter.com/pandas-dataframe-indexing/" data-type="post" data-id="64801" target="_blank">boolean indexing</a> to filter the nutrients DataFrame to include rows where the value in the <code>percent_rda</code> column contains a <code>%</code>. This filters out nutrients like alcohol where there is no RDA.</p>



<p>Pandas is such a powerful and versatile tool for working with data in Python. So I was delighted to find out it can also scrape data. </p>



<p>However, I would like to find something to handle the automation that is a little simpler to work with than Selenium. It does get the job done; perhaps I just need more experience.</p>



<h2 class="wp-block-heading">Right-clicking with Selenium</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="625" height="938" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-275.png" alt="" class="wp-image-1236534" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-275.png 625w, https://blog.finxter.com/wp-content/uploads/2023/03/image-275-200x300.png 200w" sizes="auto, (max-width: 625px) 100vw, 625px" /></figure>
</div>


<p>The main diary page has nutrient information for the day as a whole, but you can get nutrient information for each item in the food diary by right-clicking the item and choosing ‘details’ in the pop-up menu.</p>



<p>The first step is to find a way to access the food diary rows directly. For that we return to our old friend/nemesis the <code>XPATH</code>. </p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">FOOD_DIARY_XPATH = "//table[@class='crono-table']//td[@class='diary-time']/parent::tr"</pre>



<p>Unpacking the expression:</p>



<ul class="wp-block-list">
<li><code>//table</code> means give me all tables anywhere in the document</li>



<li><code>[@class=’crono-table’]</code> means of those tables only give me the ones that contain the class <code>‘crono-table’</code></li>



<li><code>//td[@class=’diary-time’]</code> means give me <code>td</code> elements that fall anywhere under the tables we got from the previous step but only if they contain the class diary-time</li>



<li><code>/parent::tr</code> means: <em>Ok, now let&#8217;s go up one level to the parent but only if it is a <code>tr</code> element</em>.</li>
</ul>



<p>So we can see the <code>XPATH</code> can pack a great deal of filtering logic into one dense compact statement. It’s a lot like <a href="https://blog.finxter.com/python-regex/" data-type="post" data-id="6210" target="_blank" rel="noreferrer noopener">regular expressions</a> in that regard.</p>



<p>Likewise, we need an <code>XPATH</code> expression for the details row in the pop-up menu</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">VIEW_EDIT_XPATH = "//*[contains(text(), 'View/Edit')]"</pre>



<p>Here the <a rel="noreferrer noopener" href="https://blog.finxter.com/python-regex-quantifiers-question-mark-vs-plus-vs-asterisk-differences/" data-type="post" data-id="6915" target="_blank">asterisk</a> <code>*</code> is a wildcard. So this expression gives us any element that contains the text “View/Edit”. </p>



<p>Here is the code to get all the food diary elements into a <a href="https://blog.finxter.com/python-lists/" data-type="post" data-id="7332" target="_blank" rel="noreferrer noopener">list</a>:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">wait = WebDriverWait(driver, 20)
diary = []
diary_elements = wait.until(EC.visibility_of_all_elements_located((By.XPATH, FOOD_DIARY_XPATH)))
diary_elements = [wait.until(EC.element_to_be_clickable(e)) for e in diary_elements]
</pre>



<p><code>WebDriverWait</code> defines an explicit wait. By explicit, this means it waits until a condition is met. </p>



<p>We told it to wait a maximum of 20 seconds for this condition to be met. </p>



<p>The first condition we look for is that all elements can be located by Selenium. If you don’t wait, your code will sometimes get ahead of the page the driver is trying to load, and you will get an error.</p>



<p>With the last line of code, I am using a <a rel="noreferrer noopener" href="https://blog.finxter.com/list-comprehension/" data-type="post" data-id="1171" target="_blank">list comprehension</a> to make sure each diary element is actually clickable before the element is added to the final list. It is possible for an element to be visible but not yet clickable. This will lead to an error when we try to right-click the element.</p>



<h2 class="wp-block-heading">Working with the calendar in cronometer</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="625" height="938" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-276.png" alt="" class="wp-image-1236535" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-276.png 625w, https://blog.finxter.com/wp-content/uploads/2023/03/image-276-200x300.png 200w" sizes="auto, (max-width: 625px) 100vw, 625px" /></figure>
</div>


<p>This was a fun puzzle to solve. How do you get to April 1 2021 from today using the controls to go back a year, back or forward a month, then locating the first day of the month on the calendar. </p>



<p>Here is what the calendar looks like:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="328" height="456" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-258.png" alt="" class="wp-image-1236453" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-258.png 328w, https://blog.finxter.com/wp-content/uploads/2023/03/image-258-216x300.png 216w" sizes="auto, (max-width: 328px) 100vw, 328px" /></figure>
</div>


<p>The first step is to get to the right year and month:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">last_year_xpath = "//div[contains(text(), '«')]"
next_month_xpath = "//div[contains(text(), '›')]"
last_month_xpath = "//div[contains(text(), '‹')]"


target_date = datetime.strptime(target_date, '%Y-%m-%d')
today = datetime.today()


last_year_button = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, last_year_xpath)))
next_month_button = driver.find_element(By.XPATH, next_month_xpath)
last_month_button = driver.find_element(By.XPATH, last_month_xpath)


for _ in range(today.year - target_date.year):  
    ac = webdriver.ActionChains(driver)
    ac.move_to_element(last_year_button).click(last_year_button).perform()
    time.sleep(2)


if target_date.month > today.month:
    for _ in range(target_date.month - today.month):
        next_month_button.click()
        time.sleep(2)
else:
    for _ in range(today.month - target_date.month):
        last_month_button.click()
        time.sleep(2)
</pre>



<p>Next, find the control for the day. </p>



<p>There are 42 days on the calendar: 3 from the previous month, 31 for the current month, and 8 for the next month. </p>



<p>We want the calendar element with the text “1”, but only the first one. The day controls all have a unique id starting at 100. The problem is the id for the first day of the month can vary.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">while first_day_text != '1':
        first_day_id += 1
        first_day_css = f"td#calendar-date-{first_day_id}"
        first_day_div = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, first_day_css))
        )
        first_day_text = first_day_div.text


first_day_div.click()
</pre>



<p>Then after scraping data for day 1, I just have to click the <code>tomorrow</code> button on the calendar and do it again until I finally reach August 15, 2021.</p>



<p>Selenium was a bit frustrating until I got the hang of it. However, once I increased the viewport size and used explicit waits, it got the job done for website automation. The <code>read_html</code> function from Pandas turned out to be a lifesaver for doing the actual scraping of the data. </p>



<h2 class="wp-block-heading">Data Analysis</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="907" height="604" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-277.png" alt="" class="wp-image-1236540" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-277.png 907w, https://blog.finxter.com/wp-content/uploads/2023/03/image-277-300x200.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-277-768x511.png 768w" sizes="auto, (max-width: 907px) 100vw, 907px" /></figure>
</div>


<p>Now for the fun part. After spending so much time scraping the data, it&#8217;s time to dive into some analysis! </p>



<p>Overall I lost .17 pounds per day with a standard deviation of .91 pounds. This lasted for 131 days for a total of 24.2 pounds lost. </p>



<p>Here is a <a href="https://blog.finxter.com/matplotlib-scatter-plot/" data-type="post" data-id="5590" target="_blank" rel="noreferrer noopener">scatter plot</a> of Weight vs Day of Challenge including a regression line:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="818" height="563" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-259.png" alt="" class="wp-image-1236456" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-259.png 818w, https://blog.finxter.com/wp-content/uploads/2023/03/image-259-300x206.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-259-768x529.png 768w" sizes="auto, (max-width: 818px) 100vw, 818px" /></figure>
</div>


<p>Wow, that is surprisingly linear! I always thought weight loss was supposed to be fast initially, then taper off. </p>



<p>The R-Squared value of .98 is very high. R-squared measures how well the regression line fits the data. Values range between 0 and 1. </p>



<p>An R-squared of 0 would indicate the regression line doesn’t fit the data at all. An R-squared of 1 indicates the regression line fits the data perfectly. </p>



<p>Another interpretation is 98% of the variation of weight can be explained by the day on the program. In other words, the plan worked like a charm! Slow and steady wins the race.</p>



<p>Here is the code for the graph above. I used the <code>LinearRegression</code> class from the <code><a href="https://blog.finxter.com/python-linear-regression-1-liner/" data-type="post" data-id="1920" target="_blank" rel="noreferrer noopener">sklearn</a></code> module to create the regression line. Unfortunately, to get <code>LinearRegression</code> to work for simple regression with only one feature we have to reshape the data.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
import seaborn as sns

def plot_regression(data, feature, target, title):
  # sklearn expects a 2d matrix so we have to reshape pandas series
  # an array of size n is reshaped into a matrix with n rows and 1 column
  y = data[target].values.reshape(-1, 1)
  X = data[feature].values.reshape(-1, 1)
  model = LinearRegression()
  model.fit(X, y)


  # get slope and intercept from model
  slope = model.coef_[0][0]
  intercept = model.intercept_[0]


  # use slope and intercept to create predictions
  weight_pred = intercept + slope * X.reshape(-1)


  # use R2 score to compare predictions to true values
  r2 = r2_score(data[target], weight_pred)


  # plot
  plt.figure(figsize=(12,8))
  sns.scatterplot(x=feature, y=target, data=data)
  plt.plot(X.reshape(-1), weight_pred, linewidth=1, color='r', label=f'y={slope:.2f} * x + {intercept:.1f}')


  # add a second row to the title to display R2
  plt.title(title + f'\nr2 = {r2:.2f} ')
</pre>



<p>The fact that the mean daily weight loss is only .17 pounds with a relatively large standard deviation of .98 pounds leads to some short-term results that can be quite frustrating. </p>



<p>For example, here is a two-week stretch where it felt like nothing was working:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="831" height="541" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-260.png" alt="" class="wp-image-1236461" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-260.png 831w, https://blog.finxter.com/wp-content/uploads/2023/03/image-260-300x195.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-260-768x500.png 768w" sizes="auto, (max-width: 831px) 100vw, 831px" /></figure>
</div>


<p>For comparison, here is a two-week stretch where everything seemed easy:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="831" height="541" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-261.png" alt="" class="wp-image-1236463" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-261.png 831w, https://blog.finxter.com/wp-content/uploads/2023/03/image-261-300x195.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-261-768x500.png 768w" sizes="auto, (max-width: 831px) 100vw, 831px" /></figure>
</div>


<p>So slow and steady may win the race, but it can often feel like losing. The trick is to have faith in the plan and keep on truckin&#8217;.</p>



<p>We can use a histogram to look at the distribution of weekly weight loss amounts:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="809" height="541" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-262.png" alt="" class="wp-image-1236464" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-262.png 809w, https://blog.finxter.com/wp-content/uploads/2023/03/image-262-300x201.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-262-768x514.png 768w" sizes="auto, (max-width: 809px) 100vw, 809px" /></figure>
</div>


<p>More good weeks than bad, and the best week dominates the worst week in absolute value: 3.5 pounds lost vs 1.5 pounds gained. There were enough positive results to stay motivated.</p>



<p>What if I repeated this challenge many times? What would the range of values for average weekly weight loss look like? </p>



<p>I can’t very well replicate the experiment 1,000 times, but I can estimate a 95% confidence interval using the bootstrap method. </p>



<p>This uses resampling with replacement to generate hypothetical samples which can be used to create a confidence interval. Because we are resampling with replacement some values can occur more than once in a given sample and others not at all. </p>



<p>This means we can generate samples from our data that are different from each other but still pulled from the same original data.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="818" height="541" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-263.png" alt="" class="wp-image-1236465" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-263.png 818w, https://blog.finxter.com/wp-content/uploads/2023/03/image-263-300x198.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-263-768x508.png 768w" sizes="auto, (max-width: 818px) 100vw, 818px" /></figure>
</div>


<p>Assuming the factors leading to my current data hold, I believe I am 95% certain if I replicated this experiment, I would lose somewhere from a pound to almost a pound and a half a week. </p>



<p>This also matches my previous experience. In previous weight loss challenges, I lost weight at a little over a pound a week. The fancy bootstrap method just makes it official. </p>



<h2 class="wp-block-heading">Looking at total calories over time</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="907" height="604" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-278.png" alt="" class="wp-image-1236541" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-278.png 907w, https://blog.finxter.com/wp-content/uploads/2023/03/image-278-300x200.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-278-768x511.png 768w" sizes="auto, (max-width: 907px) 100vw, 907px" /></figure>
</div>


<p>My daily goal was to hit a caloric deficit of at least 200 calories. Luckily cronometer will help you calculate an estimate of the number of calories you burn on a typical day. </p>



<p>It will measure your <strong>Basal Metabolic Rate</strong> and estimate how many calories you burn each day through activity. For me, the total number is 2218 calories per day. </p>



<p>If I eat this amount, I should maintain my weight. If I consistently eat less, I should lose weight. 2000 was a good round number to try and hit each day. So how did I do?</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="827" height="563" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-264.png" alt="" class="wp-image-1236467" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-264.png 827w, https://blog.finxter.com/wp-content/uploads/2023/03/image-264-300x204.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-264-768x523.png 768w" sizes="auto, (max-width: 827px) 100vw, 827px" /></figure>
</div>


<p>I struggled to hit my daily target early in the challenge. This may explain why I didn’t experience more rapid weight loss at the start. Luckily most days were below the break-even point of 2218 calories so I still lost weight.</p>



<p>After day 50, I hit the target most days. This shows I got better at eating less calories over time. Overall The total calories were not consistent at all, but they didn’t need to be. What seems to matter is the long run average.</p>



<p>In hindsight, 2000 calories is still a good target even though I can’t expect to hit it every day. By setting a mildly ambitious target, I set up a situation where I can fail a little bit and still be Ok. </p>



<h2 class="wp-block-heading">Correlations</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="907" height="926" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-279.png" alt="" class="wp-image-1236543" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-279.png 907w, https://blog.finxter.com/wp-content/uploads/2023/03/image-279-294x300.png 294w, https://blog.finxter.com/wp-content/uploads/2023/03/image-279-768x784.png 768w" sizes="auto, (max-width: 907px) 100vw, 907px" /></figure>
</div>


<p>We know that days correlate very highly with weight but what about calories? What other interesting correlations might we find?</p>



<p>We can use a correlation heat map to find out. For calories, I added some calculated fields to make it interesting.</p>



<ul class="wp-block-list">
<li><code>yesterday_total_calories</code> &#8211; total calories offset one day in the past</li>



<li><code>total_calories_7dma</code> &#8211; average calories for the previous 7 days</li>



<li><code>total_calories_14dma</code> &#8211; average calories for the previous 14 days</li>



<li><code>total_calories_21dma</code> &#8211; average calories for the previous 21 days</li>
</ul>



<p>The reason for adding the moving averages is to smooth out the day-to-day variation. </p>



<p>Here is the code to create the <a href="https://blog.finxter.com/how-to-make-heatmap-using-pandas-dataframe/" data-type="post" data-id="61559" target="_blank" rel="noreferrer noopener">heatmap</a>:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def correlation_heatmap(df, title):
    corr = df.corr()


    # Generate a mask for the upper triangle
    mask = np.zeros_like(corr, dtype=bool)
    mask[np.triu_indices_from(mask)] = True


    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(11, 9))


    # Draw the heatmap with the mask
    sns.heatmap(corr, mask=mask, cmap='BuPu', center=0,
                square=True, linewidths=.5, cbar_kws={"shrink": .5},
                annot=True)
    plt.title(title)
    plt.show()
</pre>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="806" height="734" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-265.png" alt="" class="wp-image-1236472" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-265.png 806w, https://blog.finxter.com/wp-content/uploads/2023/03/image-265-300x273.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-265-768x699.png 768w" sizes="auto, (max-width: 806px) 100vw, 806px" /></figure>
</div>


<p>As expected, the longer the time frame for the moving average, the higher the correlation between past calorie consumption and the current day&#8217;s weight. </p>



<p>Does this mean that what I ate 14 days ago affects my weight today? </p>



<p>I don’t think so. I think a lot of things, such as hydration levels, can affect your weight at any given point in time. But that averages out, in the long run, leaving total calories as the dominating factor determining body weight.</p>



<h2 class="wp-block-heading">Good Days, Bad Days</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="907" height="604" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-280.png" alt="" class="wp-image-1236544" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-280.png 907w, https://blog.finxter.com/wp-content/uploads/2023/03/image-280-300x200.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-280-768x511.png 768w" sizes="auto, (max-width: 907px) 100vw, 907px" /></figure>
</div>


<p>You know I’ve had my share. What did I eat on bad days vs good days? </p>



<p>I defined a bad day as any day I had a caloric surplus > 100 calories. It turns out I had 20 bad days, that’s 15% of the days in the challenge. That’s a lot more than I remember. </p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="779" height="286" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-266.png" alt="" class="wp-image-1236473" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-266.png 779w, https://blog.finxter.com/wp-content/uploads/2023/03/image-266-300x110.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-266-768x282.png 768w" sizes="auto, (max-width: 779px) 100vw, 779px" /></figure>
</div>


<p>Damn you sourdough, damn you straight to hell! Why do you have to taste so good? I don’t miss the other foods I’ve given up like frozen pizza, chips, cookies, soda, ice cream etc. But do I have to give up that fluffy slice of heaven known as sourdough bread? Apparently so. They say you can lose weight without giving up the foods you love. They lie. As the immortal Jack LaLanne once said “If it tastes good, spit it out!”</p>



<p>Why would decaf coffee show up on this list? I used a high-calorie creamer and drank extra cups on bad days. And it’s also something I drank pretty much every day.</p>



<p>For comparison, let’s look at the top calorie sources on good days, defined as any day with a calorie deficit > 100 calories.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="704" height="286" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-267.png" alt="" class="wp-image-1236474" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-267.png 704w, https://blog.finxter.com/wp-content/uploads/2023/03/image-267-300x122.png 300w" sizes="auto, (max-width: 704px) 100vw, 704px" /></figure>
</div>


<p>Boiled potatoes, quinoa, tofu, bananas, and sardines. Doesn’t sound very appetizing does it? </p>



<p>Apparently that’s why they work as weight-loss foods. Oh well, at least I have beer. It is a matter of pride that I could have one beer a day and still lose weight. I really looked forward to that beer every day. The sardines, not so much.</p>



<p>Why does tofu work well as a weight-loss food? It’s high in protein, and it sits in your stomach like a brick. And it won’t stimulate your appetite. Boiled potatoes are similarly filling due to the high water content. Most people think potatoes are a fattening food, but I think it’s all in how they are prepared. If you fry them in oil and smother them in salt, then absolutely they become junk food: dense in calories and overstimulating to the appetite.</p>



<h2 class="wp-block-heading">Really, Really good days</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="907" height="604" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-281.png" alt="" class="wp-image-1236546" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-281.png 907w, https://blog.finxter.com/wp-content/uploads/2023/03/image-281-300x200.png 300w, https://blog.finxter.com/wp-content/uploads/2023/03/image-281-768x511.png 768w" sizes="auto, (max-width: 907px) 100vw, 907px" /></figure>
</div>


<p>There were 4 days where I was able to eat less than 1400 calories total. What did those days look like?&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="691" height="286" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-268.png" alt="" class="wp-image-1236475" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-268.png 691w, https://blog.finxter.com/wp-content/uploads/2023/03/image-268-300x124.png 300w" sizes="auto, (max-width: 691px) 100vw, 691px" /></figure>
</div>


<p>Honey water???</p>



<p>Basically, that’s just herbal tea sweetened with honey. Apparently, I drank a lot on those days. Makes sense to fill up on liquids when trying to lose weight. </p>



<p>And I think sipping on herbal tea also distracted me from the fact that I wasn’t eating as much. And consider that a 12 ounce can of Coke has 39 grams of sugar, whereas a teaspoon of honey only has 5.6 grams of sugar. 39 grams of sugar is about 10 teaspoons worth! </p>



<p>I couldn’t imagine adding 10 teaspoons of sugar to a mug of tea, or to any drink for that matter. I couldn’t even imagine adding 10 teaspoons of sugar to a bowl of Cheerios. What happens when someone gets used to that much sugar? Healthy foods won’t taste sweet enough any more.</p>



<h2 class="wp-block-heading">Which foods are most nutritious?</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="625" height="938" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-282.png" alt="" class="wp-image-1236550" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-282.png 625w, https://blog.finxter.com/wp-content/uploads/2023/03/image-282-200x300.png 200w" sizes="auto, (max-width: 625px) 100vw, 625px" /></figure>
</div>


<p>I created a nutrient score by adding up the percent of US RDA for the vitamins and minerals for each food item in my diary divided by the number of calories. </p>



<p>The units for each food item is just how much I ate that day. So I’m looking at which foods contributed the most to meeting my nutrient needs for the least number of calories.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="493" height="345" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-269.png" alt="" class="wp-image-1236476" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-269.png 493w, https://blog.finxter.com/wp-content/uploads/2023/03/image-269-300x210.png 300w" sizes="auto, (max-width: 493px) 100vw, 493px" /></figure>
</div>


<p>Greens for the win! Adding a variety of leafy greens each day is a really good idea. And spinach tastes pretty good as long as it’s fresh, especially baby spinach. Cilantro also adds an interesting flavor.</p>



<h2 class="wp-block-heading">What about sodium?</h2>



<p>Sodium is one nutrient you don’t want to get too much of. Unfortunately, the sodium content in processed foods is very high. There were 30 days where I got more than 150% of the US RDA (Recommended Daily Allowance) of sodium, and 10 days I got higher than 200%!&nbsp;</p>



<p>What foods did I eat that were highest in sodium?</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="553" height="386" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-270.png" alt="" class="wp-image-1236478" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-270.png 553w, https://blog.finxter.com/wp-content/uploads/2023/03/image-270-300x209.png 300w" sizes="auto, (max-width: 553px) 100vw, 553px" /></figure>
</div>


<p>You&#8217;re killin’ me Trader Joe!</p>



<p>Basically all these are convenience foods that taste pretty good. The cost is too much sodium and calories. This brings to mind another Jack LaLanne quote <em>&#8220;If man makes it, don&#8217;t eat it&#8221;</em>. The good news is if I don’t eat these foods, I can afford to add some salt to my dinner. </p>



<p>A bit of salt does wonders for the taste of foods like quinoa.</p>



<h2 class="wp-block-heading">Conclusions</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="663" height="938" src="https://blog.finxter.com/wp-content/uploads/2023/03/image-283.png" alt="" class="wp-image-1236553" srcset="https://blog.finxter.com/wp-content/uploads/2023/03/image-283.png 663w, https://blog.finxter.com/wp-content/uploads/2023/03/image-283-212x300.png 212w" sizes="auto, (max-width: 663px) 100vw, 663px" /></figure>
</div>


<p>I was able to create a real-world regression model with only one feature that is extremely accurate. </p>



<p>All I need is the starting date and the number of days into the weight loss regimen and I can predict how much weight I lost with a high degree of accuracy. An R-squared of .98 is pretty darn good! The only caveat is the model is only going to be accurate after about 3 weeks. </p>



<p>I also learned a lot from analyzing the data after the fact. I was surprised at the number of times I actually failed to meet my daily targets. Yet the encouraging thing is it doesn’t matter! As long as I succeed more than fail and my successes are greater than my failures, the plan will work. And there is no need to try and hit an exact calorie amount each and every day.</p>



<p>I also learned a good bit about foods that work for me versus the ones that don’t. The key is to process your own food. If you allow Coca-Cola and Nabisco to do it for you, they will pack in the calories and make the food over-palatable, encouraging you to overeat. The key is learning to appreciate the subtle taste of healthy food vs the overwhelming taste of junk food. What makes food taste better? Salt, sugars, and fat. You want to be the one controlling the amounts.  If you know how to cook, there is also texture, presentation, herbs and spices, etc. Guess I need to learn to cook!</p>



<p>As a final note, it’s fascinating how well the conclusions I’ve drawn from the data match ancient wisdom. Here’s an example from way back in the mid 1900s:</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Jack LaLanne - Talks about the best meal plan...." width="937" height="703" src="https://www.youtube.com/embed/JDq-9K9XmSQ?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>
<p>The post <a href="https://blog.finxter.com/how-i-scraped-my-nutrition-data-from-cronometer-com/">How I Scattered My Fat with Python &#8211; Scraping and Analyzing My Nutrition Data From Cronometer.com</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to Create a DataFrame From Lists?</title>
		<link>https://blog.finxter.com/how-to-create-a-dataframe-from-lists/</link>
		
		<dc:creator><![CDATA[Charles Blue]]></dc:creator>
		<pubDate>Sat, 17 Dec 2022 08:39:56 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Data Structures]]></category>
		<category><![CDATA[Pandas Library]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Python List]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=985131</guid>

					<description><![CDATA[<p>Pandas is a great library for data analysis in Python. With Pandas, you can create visualizations, filter rows or columns, add new columns, and save the data in a wide range of formats. The workhorse of Pandas is the DataFrame. 👉 Recommended: 10 Minutes to Pandas (in 5 Minutes) So the first step working with ... <a title="How to Create a DataFrame From Lists?" class="read-more" href="https://blog.finxter.com/how-to-create-a-dataframe-from-lists/" aria-label="Read more about How to Create a DataFrame From Lists?">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/how-to-create-a-dataframe-from-lists/">How to Create a DataFrame From Lists?</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Pandas is a great library for data analysis in Python. With Pandas, you can create visualizations, filter rows or columns, add new columns, and save the data in a wide range of formats. The workhorse of Pandas is the <strong>DataFrame</strong>. </p>



<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f449.png" alt="👉" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Recommended</strong>: <a href="https://blog.finxter.com/pandas-quickstart/" data-type="post" data-id="16511" target="_blank" rel="noreferrer noopener">10 Minutes to Pandas (in 5 Minutes)</a></p>



<p>So the first step working with Pandas is often to get our data into a DataFrame. If we have data stored in <a href="https://blog.finxter.com/python-lists/" data-type="post" data-id="7332" target="_blank" rel="noreferrer noopener">lists</a>, how can we create this all-powerful DataFrame? </p>



<p>There are 4 basic strategies:</p>



<ol class="wp-block-list" type="1">
<li>Create a <a href="https://blog.finxter.com/python-dictionary/" data-type="post" data-id="5232" target="_blank" rel="noreferrer noopener">dictionary</a> with column names as keys and your lists as values. Pass this dictionary as an argument when creating the DataFrame.</li>



<li>Pass your lists into the <code><a href="https://blog.finxter.com/python-ziiiiiiip-a-helpful-guide/" data-type="post" data-id="1938" target="_blank" rel="noreferrer noopener">zip()</a></code> function. As with strategy 1, your lists will become columns in the DataFrame.</li>



<li>Put your lists into a list instead of a dictionary. In this case, your lists become rows instead of columns.</li>



<li><a href="https://blog.finxter.com/how-to-create-a-dataframe-in-pandas/" data-type="post" data-id="16764" target="_blank" rel="noreferrer noopener">Create an empty DataFrame</a> and add columns one by one.</li>
</ol>



<h2 class="wp-block-heading">Method 1: Create a DataFrame using a Dictionary</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="1010" height="645" src="https://blog.finxter.com/wp-content/uploads/2022/12/image-237.png" alt="" class="wp-image-985155" srcset="https://blog.finxter.com/wp-content/uploads/2022/12/image-237.png 1010w, https://blog.finxter.com/wp-content/uploads/2022/12/image-237-300x192.png 300w, https://blog.finxter.com/wp-content/uploads/2022/12/image-237-768x490.png 768w" sizes="auto, (max-width: 1010px) 100vw, 1010px" /></figure>
</div>


<p>The first step is to import pandas. If you haven&#8217;t already, <a href="https://blog.finxter.com/how-to-install-pandas-in-python/" data-type="post" data-id="35926" target="_blank" rel="noreferrer noopener">install pandas</a> first.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd</pre>



<p>Let&#8217;s say you have employee data stored as lists.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># if your data is stored like this
employee = ['Betty', 'Veronica', 'Archie', 'Jughead']
salary = [110_000, 20_000, 80_000, 70_000]
bonus = [1000, 500, 2500, 400]
tax_rate = [.1, .25, .17, .4]
absences = [0, 1, 0, 52]
</pre>



<p>Build a dictionary using column names as keys and your lists as values.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># you can easily create a dictionary that will define your dataframe
emp_data = {
    'name': employee,
    'salary': salary,
    'bonus': bonus,
    'tax_rate': tax_rate,
    'absences': absences
}
</pre>



<p>Your lists will become columns in the resulting DataFrame.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="367" height="164" src="https://blog.finxter.com/wp-content/uploads/2022/12/image-230.png" alt="" class="wp-image-985144" srcset="https://blog.finxter.com/wp-content/uploads/2022/12/image-230.png 367w, https://blog.finxter.com/wp-content/uploads/2022/12/image-230-300x134.png 300w" sizes="auto, (max-width: 367px) 100vw, 367px" /></figure>
</div>


<h2 class="wp-block-heading">Create a DataFrame using the zip function</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="1010" height="668" src="https://blog.finxter.com/wp-content/uploads/2022/12/image-238.png" alt="" class="wp-image-985156" srcset="https://blog.finxter.com/wp-content/uploads/2022/12/image-238.png 1010w, https://blog.finxter.com/wp-content/uploads/2022/12/image-238-300x198.png 300w, https://blog.finxter.com/wp-content/uploads/2022/12/image-238-768x508.png 768w" sizes="auto, (max-width: 1010px) 100vw, 1010px" /></figure>
</div>


<p>Pass each list as a separate argument to the <code><a rel="noreferrer noopener" href="https://blog.finxter.com/python-ziiiiiiip-a-helpful-guide/" data-type="post" data-id="1938" target="_blank">zip()</a></code> function. You can specify the column names using the <code>columns</code> parameter or by setting the <code>columns</code> property on a separate line.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">emp_df = pd.DataFrame(zip(employee, salary, bonus, tax_rate, absences))
emp_df.columns = ['name', 'salary', 'bonus', 'tax_rate', 'absences']
</pre>



<p>The <code>zip()</code> function creates an <a href="https://blog.finxter.com/iterators-iterables-and-itertools/" data-type="post" data-id="29507" target="_blank" rel="noreferrer noopener">iterator</a>. For the first iteration, it grabs every value at index 0 from each list. This becomes the first row in the DataFrame. Next, it grabs every value at index 1 and this becomes the second row. This continues until it exhausts the shortest list.</p>



<p>We can loop thru the iterator to see how this works.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">i = 0
for value in zip(employee, salary, bonus, tax_rate, absences):
  print(f'zipped value at index {i}: {value}')
  i += 1
</pre>



<p>Each of these values becomes a row in the DataFrame:</p>



<pre class="wp-block-preformatted"><code>zipped value at index 0: ('Betty', 110000, 1000, 0.1, 0)
zipped value at index 1: ('Veronica', 20000, 500, 0.25, 1)
zipped value at index 2: ('Archie', 80000, 2500, 0.17, 0)
zipped value at index 3: ('Jughead', 70000, 400, 0.4, 52)</code>
</pre>



<h2 class="wp-block-heading">Create a DataFrame using a list of lists</h2>



<p>What if you have a separate list for each employee? In this case, we can just create a <a href="https://blog.finxter.com/python-list-of-lists/" data-type="post" data-id="7890" target="_blank" rel="noreferrer noopener">list of lists</a>. Each of the inner lists becomes a row in the DataFrame.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># lists for employees instead of features
betty = ['Betty', 110000, 1000, 0.1, 0]
veronica = ['Veronica', 20000, 500, 0.25, 1]
archie = ['Archie', 80000, 2500, 0.17, 0]
jughead = ['Jughead', 70000, 400, 0.4, 52]

emp_df = pd.DataFrame([betty, veronica, archie, jughead])
emp_df.columns = ['name', 'salary', 'bonus', 'tax_rate', 'absences']
emp_df
</pre>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="380" height="158" src="https://blog.finxter.com/wp-content/uploads/2022/12/image-231.png" alt="" class="wp-image-985145" srcset="https://blog.finxter.com/wp-content/uploads/2022/12/image-231.png 380w, https://blog.finxter.com/wp-content/uploads/2022/12/image-231-300x125.png 300w" sizes="auto, (max-width: 380px) 100vw, 380px" /></figure>
</div>


<h2 class="wp-block-heading">Create a DataFrame using a list of dictionaries</h2>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="856" height="863" src="https://blog.finxter.com/wp-content/uploads/2022/12/image-239.png" alt="" class="wp-image-985157" srcset="https://blog.finxter.com/wp-content/uploads/2022/12/image-239.png 856w, https://blog.finxter.com/wp-content/uploads/2022/12/image-239-298x300.png 298w, https://blog.finxter.com/wp-content/uploads/2022/12/image-239-150x150.png 150w, https://blog.finxter.com/wp-content/uploads/2022/12/image-239-768x774.png 768w" sizes="auto, (max-width: 856px) 100vw, 856px" /></figure>
</div>


<p>If the employee data is stored in dictionaries instead of lists, we use a list of dictionaries.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">betty = {'name': 'Betty', 'salary': 110000, 'bonus': 1000, 
         'tax_rate': 0.1, 'absences': 0}

veronica = {'name': 'Veronica', 'salary': 20000, 'bonus': 500, 
            'tax_rate': 0.25, 'absences': 1}

archie = {'name': 'Archie', 'salary': 80000, 'bonus': 2500, 
          'tax_rate': 0.17, 'absences': 0}
          
jughead = {'name': 'Jughead', 'salary': 70000, 'bonus': 400, 
           'tax_rate': 0.4, 'absences': 52}

pd.DataFrame([betty, veronica, archie, jughead])</pre>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="374" height="159" src="https://blog.finxter.com/wp-content/uploads/2022/12/image-232.png" alt="" class="wp-image-985146" srcset="https://blog.finxter.com/wp-content/uploads/2022/12/image-232.png 374w, https://blog.finxter.com/wp-content/uploads/2022/12/image-232-300x128.png 300w" sizes="auto, (max-width: 374px) 100vw, 374px" /></figure>
</div>


<p>The columns are determined by the keys in the dictionaries. What if the dictionaries don’t all have the same keys?</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">betty = {'name': 'Betty', 'salary': 110000, 'bonus': 1000, 
         'tax_rate': 0.1, 'absences': 0, 'hire_date': '2001-01-01'}

veronica = {'name': 'Veronica', 'salary': 20000, 'bonus': 500, 
            'tax_rate': 0.25, 'absences': 1}

archie = {'name': 'Archie', 'salary': 80000, 'bonus': 2500, 
          'tax_rate': 0.17, 'absences': 0, 'title': 'Vice Chief Leader'}
          
jughead = {'name': 'Jughead', 'salary': 70000, 'bonus': 400,      
           'tax_rate': 0.4, 'absences': 52, 'rank': 'yes'}

pd.DataFrame([betty, veronica, archie, jughead])
</pre>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="624" height="151" src="https://blog.finxter.com/wp-content/uploads/2022/12/image-233.png" alt="" class="wp-image-985147" srcset="https://blog.finxter.com/wp-content/uploads/2022/12/image-233.png 624w, https://blog.finxter.com/wp-content/uploads/2022/12/image-233-300x73.png 300w" sizes="auto, (max-width: 624px) 100vw, 624px" /></figure>
</div>


<p>All of the keys will be used. Anytime pandas encounters a dictionary with a missing key, the missing value will be replaced with NaN which stands for ‘not a number’.</p>



<h2 class="wp-block-heading">Create an empty DataFrame and add columns one by one</h2>



<p>This method might be preferable if you needed to create a lot of new calculated columns. Here we create a new column for after-tax income.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">emp_df = pd.DataFrame()
emp_df['name'] = employee
emp_df['salary'] = salary
emp_df['bonus'] = bonus
emp_df['tax_rate'] = tax_rate
emp_df['absences'] = absences

income = emp_df['salary'] + emp_df['bonus']
emp_df['after_tax'] = income * (1 - emp_df['tax_rate'])
</pre>



<h2 class="wp-block-heading">How to add a list to an existing DataFrame</h2>



<p>Here is a neat trick. If you want to edit a row in a DataFrame you can use the handy <code><a href="https://blog.finxter.com/slicing-data-from-a-pandas-dataframe-using-loc-and-iloc/" data-type="post" data-id="230997" target="_blank" rel="noreferrer noopener">loc</a></code> method. Loc allows you to access rows and columns by their index value.</p>



<p>To access a row:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">emp_df.loc[3]</pre>



<p>Output is the row with index value 3 as a Series:</p>



<pre class="wp-block-preformatted"><code>name        Jughead
salary        70000
bonus           400
tax_rate        0.4
absences         52
Name: 3, dtype: object</code>
</pre>



<p>To access a column just pass in the column name as the index. Note that we have to specify the row and column indexes. The format is <code>[rows, columns]</code>. If you want all rows you can use “<code>:</code>” as we do here. The <code>:</code> also works if you want all columns.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">emp_df.loc[:, 'salary']</pre>



<p>Output is also a series</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">0    110000
1     20000
2     80000
3     70000
4    200000
Name: salary, dtype: int64
</pre>



<p>So how do we use <code>loc</code> to add a new row? If we use a row index that doesn’t exist in the DataFrame, it will create a new row for us.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">new_emp = ['Fonzie', 200000, 30000, .05, 112]
emp_df.loc[4] = new_emp
emp_df
</pre>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="366" height="183" src="https://blog.finxter.com/wp-content/uploads/2022/12/image-234.png" alt="" class="wp-image-985148" srcset="https://blog.finxter.com/wp-content/uploads/2022/12/image-234.png 366w, https://blog.finxter.com/wp-content/uploads/2022/12/image-234-300x150.png 300w" sizes="auto, (max-width: 366px) 100vw, 366px" /></figure>
</div>


<p>You can also update existing data with <code>loc</code>. Let’s drop Fonzie’s salary. It looks a bit excessive.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">emp_df.loc[4, 'salary'] = 105000
emp_df
</pre>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="376" height="183" src="https://blog.finxter.com/wp-content/uploads/2022/12/image-235.png" alt="" class="wp-image-985149" srcset="https://blog.finxter.com/wp-content/uploads/2022/12/image-235.png 376w, https://blog.finxter.com/wp-content/uploads/2022/12/image-235-300x146.png 300w" sizes="auto, (max-width: 376px) 100vw, 376px" /></figure>
</div>


<p>That’s more like it.</p>



<h2 class="wp-block-heading"><strong>Conclusion</strong></h2>



<p>There are many different ways of creating a DataFrame. We looked at several methods using data stored in lists. Each will get the job done. </p>



<p>The most convenient method will depend on what your lists represent. </p>



<p>If each of your lists would best be represented as a column, then a dictionary of lists might be the easiest way to go. </p>



<p>If each of your lists would best be represented as a row, then a list of lists would be a good choice. </p>



<p>To add data in a list as a new row in an existing DataFrame, the <code>loc</code> method comes in handy. Loc is also useful for updating existing data.</p>
<p>The post <a href="https://blog.finxter.com/how-to-create-a-dataframe-from-lists/">How to Create a DataFrame From Lists?</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/?utm_source=w3tc&utm_medium=footer_comment&utm_campaign=free_plugin

Page Caching using Disk: Enhanced 
Minified using Disk

Served from: blog.finxter.com @ 2026-04-27 02:35:26 by W3 Total Cache
-->