Charles Blue, Author at Be on the Right Side of Change

How I Scraped Data From Over 16,000 Gyms from MindBodyOnline.com

Charles Blue — Thu, 19 Oct 2023 08:56:58 +0000

This article is based on a freelance job posted on Upwork to scrape data for all the gyms in the USA from MindBodyOnline.com or another similar site. I treated this as a learning project, and it was a good one, as I learned a lot!

Web scraping, a technique used to extract data from websites, has become an essential skill on Upwork — it’s one of the most sought-after skills on most freelancing platforms. Most beginners start with the Beautiful Soup and Requests modules in Python. While these tools are powerful, they’re not always sufficient for every site. Enter tools like Selenium, which, while powerful, can sometimes be overkill or inefficient.

So, where should one start? The answer is simple: Always check for an API first.

Why Start with APIs?

An Application Programming Interface (API) allows two software applications to communicate with each other. Many websites offer APIs to provide structured access to their data, making it easier and more efficient than scraping the web pages directly.

Benefits of using APIs:

Efficiency: Extracting data from APIs is often faster and less resource-intensive than scraping web pages.
Reliability: APIs are designed to be accessed programmatically, reducing the chances of breaking changes.
Ethical considerations: Accessing data via an API is often more in line with a website’s terms of service than scraping their pages directly.

MindBodyOnline provides a dedicated API tailored for developers: MindBody API.

If you’re aiming to craft an app utilizing their dataset, this API is your ideal resource. It boasts a plethora of endpoints, enabling swift data retrieval and ensuring seamless interaction between your application and their servers.

But what if you aren’t creating an application and just need to scrape data once for research? MindBodyOnline also retrieves data for its website via an API. Javascript is used to request the data needed to populate their website. We can also make requests for this API.

How to check if a website is rendered with Javascript

The site we will be scraping is MindBodyOnline.

If a website is rendered with Javascript, we should check the network traffic and see if we can find a request that returns the data we see on the page. This can be done quickly with developer tools. With Chrome, you can bring up developer tools by clicking Ctl-Shift-I.

From here, we can turn off Javascript, then refresh the page and see if there are any changes. To turn off Javascript, first hit Ctl-Shift-P to bring up the command palette. Start typing Javascript to filter the options, then click “Disable javascript”.

Then refresh the page. As we can see, they use JavaScript for all the data.

Before we can continue, we need to turn JavaScript back on. Bring up the command palette again, filter for javascript, and click “Enable Javascript”. Then refresh the page again.

Check the JavaScript Requests

Select the Network tab in developer tools.

Make sure Fetch/XHR and Preserve log are selected. Next, we can click the circle with the line through it to clear the output. Then perform a search to see what requests were performed.

We can then check each item in the output to see if it returns useful information.

We are primarily interested in the response to the request. We are looking for XML data that looks like the data shown on the page. In this case, it is the locations request that contains the data we seek.

We can also see that there is a payload required. When we make our requests, we must provide this payload in the request body. There are three items of interest here. The latitude and longitude allow us to control the city we are pulling data for, and we also need to provide a page number.

MindBody uses pagination, so a relatively small amount of data is pulled with each request. A large city like New York can have over a hundred pages.

We go to the headers tab to copy the request URL.

Using Insomnia to Generate Request Headers

From here, we can use a tool to help us with the request syntax.

Insomnia is a powerful open-source API client tool for testing and debugging APIs. It provides a user-friendly interface to send requests to web services and view responses. With Insomnia, you can define various request types, from simple HTTP GET requests to complex JSON, GraphQL, or even multipart file uploads. You can download the insomnia desktop app here.

Using Insomnia is quite simple. Just paste in the API URL and click Send.

We can check the preview tab to make sure it returns the data we want:

This is where it gets good. If we click the dropdown on the send button, one of the options is “generate client code”. How convenient! Just click Python as the language and use the Requests module and you can click “Copy to Clipboard” and you’re off to the races.

A Simple Scrapy Spider

The code can be found on Github. I will walk through the code below, starting with the imports.

import scrapy
import json
import pandas as pd
from scrapy.crawler import CrawlerProcess
import os

Scrapy is a good option because it can handle multiple requests at the same time with asynchronous processing. Scapy has a lot of bells and whistles and a fair bit of a learning curve, but it’s also possible to avoid a lot of the extra complexity. The goal here was to place all the code in one simple script.

First, we have to create a spider class. The class is pretty large so I’ll display it in chunks.

class MindbodySpider(scrapy.Spider):
    name = 'mindbody_spider'

    custom_settings = {
        'CONCURRENT_REQUESTS': 5,
        'DOWNLOAD_DELAY': 3.2,
    }

Our class inherits from one of the Scrapy Spider classes with scrapy.Spider being the simplest. In the custom settings, with CONCURRENT_REQUESTS set to 5, scrapy will be processing six requests at a time, starting a new one as soon as one finishes.

We use a DOWNLOAD_DELAY so we don’t bombard the website with too many requests at once.

Next, we need a starting template for the payload

starting_payload = '''{
                          "sort":"-_score,distance",
                          "page":{"size":50,"number":<>},
                          "filter":{"categories":"any",
                                    "latitude":<>,
                                    "longitude":<>,
                                    "categoryTypes":"any"}
                       }'''

Next, we have the headers that Insomnia so helpfully provided for us.

headers = {
        "cookie": "__cf_bm=zdIhLHXKd2OAveBChKORUMdydUFVzC2Ma51sQxv.UJ0-1694646164-0-Abmbwcj2wNw%2FpityY4DWRWy%2FftBkjTO0vQ3tZ0gwU0P5bsTqcasf2XZlBwL%2BUaevGaH%2BTDzZOJPBXbWYwgsXkJc%3D",
        "authority": "prod-mkt-gateway.mindbody.io",
        "accept": "application/vnd.api+json",
        "accept-language": "en-US,en;q=0.9",
        "content-type": "application/json",
        "origin": "https://www.mindbodyonline.com",
        "sec-ch-ua": "^\^Not/A",
        "sec-ch-ua-mobile": "?0",
        "sec-ch-ua-platform": "^\^Windows^^",
        "sec-fetch-dest": "empty",
        "sec-fetch-mode": "cors",
        "sec-fetch-site": "cross-site",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
        "x-mb-app-build": "2023-08-02T13:33:44.200Z",
        "x-mb-app-name": "mindbody.io",
        "x-mb-app-version": "e5d1fad6",
        "x-mb-user-session-id": "oeu1688920580338r0.2065068094427127"
    }

Then a very simple init method

def __init__(self):
        scrapy.Spider.__init__(self)
        self.city_count = 0

The start_requests method loops through each city. This is the main loop that creates the first request for each city.

def start_requests(self):
        cities = pd.read_csv('uscities.csv')

        for idx, city in cities[].iterrows():
            lat, lon = city.lat, city.lng
            self.logger.info(f"{city.city}, {city.state_id} started")

            # Start with the first page for each city
            payload = self.starting_payload.replace('<>', '1').replace('<>', str(lat)).replace('<>', str(lon))

            yield scrapy.Request(
                url="https://prod-mkt-gateway.mindbody.io/v1/search/locations",
                method="GET",
                body=payload,
                headers=self.headers,
                meta={'city_name': city.city, 'page_num': 1, 'lat': lat, 'lon': lon, 'state': city.state_id},
                callback=self.parse
            )

The code is pretty simple. We create a DataFrame from a CSV file with city information and then loop through it with the iterrows method. We create the payload for the request using the template and the lat/long values from the DataFrame. The page is set to 1 each time. We will handle additional pages later.

Finally, we yield a scrapy.Request object. We use yield instead of return so we can handle multiple requests concurrently. The body is our modified payload, and we use the same header for each request.

What do we do with the response returned from the request? As soon as the response is returned it is fed into the parse method thanks to the callback parameter:

callback=self.parse

The meta parameter gives us a way to pass information to the callback function. We need the page num, lat, lon values for the next request. City_name and state are used for screen outputs.

The list of cities was downloaded off the web. Many different options will work, as long as they contain latitude and longitude values.

Parsing the Response

The parse method is a little long, but not too complicated.

Getting the data and saving it is very easy. We just convert response.text to a DataFrame and save it to a CSV file. If the file already exists, we will append the data and not include a header. Otherwise, we create a new CSV file and include a header.

def parse(self, response):
        data = json.loads(response.text)
        gyms_df = pd.json_normalize(data['data'])

        # Save the dataframe to a CSV
        city_name = response.meta['city_name']
        state = response.meta['state']
        fname = f'{city_name} {state}.csv'.replace(' ', '_')
        csv_path = f'./data/cities2/{fname}'

        # Check if file exists to determine the write mode
        write_mode = 'a' if os.path.exists(csv_path) else 'w'

        gyms_df.to_csv(csv_path, 
                       mode=write_mode, 
                       index=False, 
                       header=(not os.path.exists(csv_path)))

Handling Pagination

To move on to the next page, we need to create another Scrapy Request. For the payload we use the same latitude and longitude and increment the page number by 1.

        # Check if there's another page and if so, initiate the request
        next_page_num = response.meta['page_num'] + 1
        if next_page_num <= 150:  # Optional: upper limit
            lat, lon = response.meta['lat'], response.meta['lon']  # Assuming you store lat and lon in meta too

            payload = self.starting_payload.replace('<>', '1').replace('<>', str(lat)).replace('<>', str(lon))

Make the Request for the Next Page

To finish the parse method, all we have to do is make another request with the new payload.

yield scrapy.Request(
                url="https://prod-mkt-gateway.mindbody.io/v1/search/locations",
                method="GET",
                body=payload,
                headers=self.headers,
                meta={'city_name': response.meta['city_name'], 
                      'page_num': next_page_num, 
                      'lat': lat, 
                      'lon': lon,
                      'state': state},
                callback=self.parse
            )

        self.city_count += 1
        print(response.meta['city_name'], f'complete ({self.city_count})')
        self.logger.info(f"""{response.meta['city_name']}, 
                           {response.meta['state']} is complete""")

How the Pagination Loop Terminates

What happens if there are 100 pages for the current city and the code sends a request with page_num = 101?

The request will not return anything, so the callback function won’t get called and the recursive loop for that city will stop.

Then the start_requests loop will move on to the next city.

It’s alive! Setting Our Little Spider Loose

To get our creepy critter crawling, we create a CrawlerProcess. Then tell it to crawl. Then tell it to start. On your mark, get set, CRAWL!

process = CrawlerProcess()
process.crawl(MindbodySpider)
process.start()

Results

I was able to scrape data for 16,000 cities in about half a week. I think I averaged about 100 cities an hour. The larger cities had over a hundred pages but there were thousands upon thousands of cities with 5-10 pages.

What about the data? It’s fairly extensive and could be very useful.

Pretty good information related to services offered, location, amenities, total ratings etc. Looking at the rest of the columns:

Conclusion

Uncovering the API proved invaluable. It eliminated the need to craft path selectors for individual data elements, significantly streamlining the process. Moreover, it spared me from devising a Scrapy workaround for the JavaScript-rendered page. Investing time in learning Scrapy was a sound decision, given its superior speed compared to other methods I explored.

Looking ahead, the logical progression is to integrate the data into platforms like Jupyter Notebook, Power BI, or Tableau. Furthermore, storing the data in a database seems apt, especially considering the apparent one-to-many relationships observed in each city, like categories and subcategories.

If you want to become a master web scraper, feel free to check out our academy course with downloadable PDF certificate to showcase your skills to future employers or freelancing clients:

Academy: Web Scraping with BeautifulSoup

The post How I Scraped Data From Over 16,000 Gyms from MindBodyOnline.com appeared first on Be on the Right Side of Change.

How I Cracked the Top 100 in the Kaggle House Prices Competition

Charles Blue — Wed, 07 Jun 2023 09:04:29 +0000

Kaggle is a vibrant online community for data science and machine learning, providing a platform for learning, sharing, and competition. It’s an invaluable resource for individuals interested in these fields, regardless of their level of experience.

The Kaggle House Prices – Advanced Regression Techniques Competition, in particular, is an excellent starting point for anyone who has completed a data science or machine learning course and is eager to gain practical experience.

Participants are tasked with predicting the final price of residential homes in Ames, Iowa, based on 79 explanatory variables describing various aspects of the properties.

The variables include a vast array of house attributes such as the type of dwelling, the size of the living area, the number of rooms, the year the house was built, the quality and condition of various features, the neighborhood, and many more. The challenge aims to encourage the application of advanced regression techniques and creative feature engineering to build models that can accurately predict house prices, an important task in real estate analytics.

A couple of years ago, right after finishing an online data science bootcamp, I decided to try my hand at the House Prices competition. I found it equally fun and frustrating. I became obsessed with cracking the top 100 on the leaderboard. After much struggle, I finally made it. The code can be found here.

I thought it would be fun to revisit this challenge and write an article about it.

After dusting off the code, I found it held up pretty well. It put me in the 130s on the public leaderboard. I figured I’d tweak the code a bit, get back in the top 100 and write my article. Unfortunately, I got stuck just below 110 and found myself trapped in the same cycle:

Try anything and everything I can think of
Review other notebooks and try everything everyone else thought of
Find my current notebook a bloated mess and hard to work with so I start another one.

Finally, I found another notebook someone graciously posted here which got a score I was looking for. It took a while to unpack the code. In doing so, I found some things that worked, but I didn’t really know how they were derived or why they worked. The biggest difference I found was this notebook focused much more heavily on feature engineering than I did.

After much effort, I finally stumbled upon “3 simple tricks” that I found particularly helpful. I was able to grind my way to the score I wanted while feeling like I actually understood what was going on. Here they are:

Use a sklearn pipeline and a “train_test” function to organize the code.
Use visualizations and Pandas groupby queries to brainstorm feature engineering ideas
Use the tpot library to help brainstorm ideas for jazzing up the pipeline and using more advanced models.

The full model can be found in this kaggle notebook. But I hope you will take a stab at the competition first, then compare your code to mine. I got a kaggle score of .11229 which at this point in time is good enough for rank 84 out of 4742 entries.

Getting Started

The easiest way to get started on the competition is to join the competition and create a notebook within Kaggle. From the competition page, click the code tab and the New Notebook button.

The first cell in the new notebook will already be populated. If you run the cell it will show you where to get the data. You can then load the data into Pandas DataFrames using the given locations.

sample_submission = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/sample_submission.csv")
train = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/train.csv")
test = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/test.csv")

Next, all we have to do is split the data into the standard X and y for the features and target. Then we will be ready to create the pipeline.

X = train.drop('SalePrice', axis=1)
y = train[['SalePrice']].copy()
y = np.log1p(y)

SalePrice is skewed. A handful of very expensive houses extend the right tail. The log of SalePrice is much closer to a normal distribution.

There is an interesting side effect of building a model on the log-transformed target variable. When you use the log of SalePrice as the response variable, the interpretation of the coefficients changes.

Now, a one-unit increase in a predictor variable corresponds to a percentage change in SalePrice, rather than an absolute change.

So, in the log-transformed model, if the coefficient of a predictor variable is 0.01, then a one-unit increase in that predictor is associated with an approximately 1% increase in SalePrice.

This means a coefficient can work just as well for a $60,000 house as a $600,000 house.

Machine Learning Pipelines: A Key Tool in Model Building

In the realm of machine learning, Sklearn’s pipeline is an indispensable tool that simplifies the process of building and evaluating models. It neatly chains together data transformation steps and the machine learning model in a sequence.

When you fit the pipeline, it seamlessly performs the data transformations before fitting the model with the transformed data.

To demonstrate the usage of pipelines, let’s consider a task. We have our features stored in a DataFrame X and target values in a variable y. Our goal is to create a pipeline that:

Imputes null values for numerical data with the median
Imputes null values for text data with the most common value
Scales the numeric data with StandardScaler
Uses One Hot Encoding on the text data

Here’s how you can implement it:

from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder


# Identify numeric columns
numeric_columns = X.select_dtypes(include=['number']).columns


# Identify categorical columns
categorical_columns = X.select_dtypes(include=['object']).columns


# Create transformers
numeric_transformer = make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler()
)


categorical_transformer = make_pipeline(
    SimpleImputer(strategy='most_frequent'),
    OneHotEncoder(handle_unknown='ignore')
)


# Combine transformers into a preprocessor step
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns)
    ]
)


# The preprocessor can now be used in a pipeline with a final estimator
# model = make_pipeline(preprocessor, YourModel())

This code has three essential parts:

Identifying the types of columns: Numeric columns are handled differently from non-numeric ones. We fill null values for numeric columns with the median and for text data with the most common value.
Creating transformers: We use the make_pipeline function to create a data transformer for each type of column. The numeric transformer imputes values then scales them, and the categorical transformer fills missing data with the most frequent value, then applies One Hot Encoding to the result.
Combining transformers: We apply different transformers to different columns using the ColumnTransformer.

Next, let’s package this process into a function, train_and_test, which accepts a machine learning model and a data manipulation function as parameters. This allows us to easily test different models and feature engineering approaches.

def train_and_test(model, X, y, data_func=None):
    X_copy = X.copy()
    y_copy = y.copy()


    if data_func:
        data_func(X_copy, y_copy)


    pipe = make_pipeline(
        impute_and_encode(X_copy),
        model
    )


    pipe.fit(X_copy, y_copy)
    evaluate_model(pipe, X, y)

Evaluating the Model with RMSE

RMSE stands for Root Mean Squared Error. Here’s how it works: for each data point, the model’s predicted value is subtracted from the actual value to give the prediction error. Each of these errors is then squared and the results are averaged across all data points. Finally, the square root of this average is taken to give the RMSE.

Because the errors are squared before averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE is most useful when large errors are particularly undesirable. Here is the code to evaluate model performance:

def evaluate_model(model, X, y):
    model.fit(X, y)


    rmse_scores = np.sqrt(-cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=5))
    rmse_mean = rmse_scores.mean()


    # Calculate R-squared score using cross validation
    r2_scores = cross_val_score(model, X, y, scoring="r2", cv=5)
    r2_mean = r2_scores.mean()
    print(f'mean RMSE with 5 folds: {rmse_mean}')
    print(f'mean R2: {r2_mean}')
    return rmse_mean, r2_mean

The basic idea behind cross-validation is to divide the data into a number of subsets, or ‘folds’.

The model is then trained on all but one of these folds and tested on the remaining fold. This process is repeated with each fold serving as the test set once.

This is often referred to as K-fold cross-validation, where K is the number of folds. Cross-validation gives a better measure of how well your model will perform on unseen data than using a single train-test split.

Unleashing Exploratory Data Analysis for Feature Engineering

Feature engineering is a crucial phase in the model-building process where you transform existing features and create new ones with the aim of enhancing model performance. A great starting point for feature engineering is to get acquainted with the existing features through Exploratory Data Analysis (EDA). Let’s see how this process can lead us to discover some intriguing insights.

A widely used EDA visualization tool is the heatmap, which provides an overview of feature correlations. Let’s take a closer look at how our features correlate with ‘SalePrice‘ – the target feature.

plt.figure(figsize=(4,10))
sns.heatmap(train.corr()[['SalePrice']], annot=True)
plt.title('Correlations with SalePrice')
plt.show()

A notable anomaly in this heatmap is the feature ‘OverallCond‘, which denotes the overall condition of the house on a scale of 1 to 10 (10 being the best).

Intuitively, we’d expect houses in better condition to fetch higher prices, translating to a strong positive correlation. But surprisingly, ‘OverallCond‘ demonstrates a meager correlation of -0.037 with ‘SalePrice‘.

This presents an exciting puzzle – can we improve the model’s performance by modifying ‘OverallCond‘, crafting a new feature, or simply discarding it? With our pipeline and train_and_test function set up, testing these alternatives is a breeze.

Before we proceed, let’s visualize ‘OverallCond‘ vs ‘SalePrice‘ on a scatter plot:

The plot seems to suggest a positive correlation, contradicting the correlation matrix. A peek at the histogram of ‘OverallCond‘ unveils that the majority of houses have a value of 5.

Let’s posit a hypothesis – Could the age of the house influence how ‘OverallCond‘ affects ‘SalePrice‘?

Let’s divide our data into older and newer houses (built before and after 1980, respectively) and plot them against ‘SalePrice‘.

older_house = X.YearBuilt < 1980
plot = sns.scatterplot(x=X.OverallCond, y=train.SalePrice, hue=older_house)
legend = plot.legend_
legend.set_title("Built before 1980")
plt.show()

Interesting! It appears that for newer houses, ‘OverallCond‘ generally receives a default value of 5. For older houses, however, the ‘OverallCond‘ rating seems to matter more.

To capitalize on this observation, we’ll create a new feature, ‘HouseAge‘, to represent the age of the house, and another, ‘AgeCond‘, to capture the interaction between ‘HouseAge‘ and ‘OverallCond‘.

def house_age(X, y):
    X['HouseAge'] = X.YrSold - X.YearBuilt
    X['AgeCond'] = X.HouseAge * X.OverallCond


train_and_test(LinearRegression(), house_age)

Incorporating these changes leads to a reduction in the RMSE from .1566 to .1562. While most experiments might not bear fruit and successful ones may bring minor improvements, persisting with this iterative process will gradually lead you to a well-performing model.

Error Residuals for Feature Creation

Error residuals, simply referred to as residuals, depict the gap between the actual and predicted values of a data point. In essence, it’s the enigmatic segment of your model’s prediction. In the realm of linear regression, it’s calculated as e = y – ŷ, where ‘y’ denotes the observed value, and ‘ŷ’ represents the predicted value from your model.

A healthy model ideally has normally distributed and random residuals. By uncovering patterns within these errors, we can pinpoint the model’s blind spots, fueling us with novel feature creation ideas.

To illuminate this, let’s first establish a function to predict:

def generate_predictions(model, data_func=None):
    X_copy = X.copy()
    y_copy = y.copy()


    if data_func:
        data_func(X_copy)


    pipe = make_pipeline(
      data_preprocessor(X_copy),
      model
    )


    pipe.fit(X_copy, y_copy)
    predictions = pipe.predict(X_copy)
    return predictions

With predictions in hand, we calculate and visualize the residuals:

predicted_prices = generate_predictions(LinearRegression(), house_age)
residuals = y.SalePrice - predicted_prices


plt.plot(range(len(y)), residuals, 'bo', alpha=.5)
plt.title('Error Residuals')
plt.xlabel('House Index')
plt.ylabel('Residual Value')
plt.show()

The larger negative residuals represent cases where the model way over predicted SalePrice. We can look at these houses and see if we can find some new information that will help the model predict lower prices. We are looking for something negative about these houses that the model didn’t see.

A quick scan reveals that these unpredictable homes often have an OverallQual rating below 5, and a SaleCondition that is not “Normal”.

train.loc[np.abs(residuals.SalePrice) > 0.4, ['SaleCondition', 'OverallQual', 'SalePrice']]

Utilizing the groupby function of Pandas, we compare median prices for true versus false conditions, ideally spotting substantial price differences with a reasonable record count for each condition:

train.groupby((train.OverallQual < 5)).agg(dict(SalePrice=['median', 'count']))

We can easily modify the code to test similar conditions

fltr = (train.SaleCondition=='Abnorml') & (train.OverallQual < 5)
train.groupby(fltr).agg(dict(SalePrice=['median', 'count']))

Now we can create a new feature and see if it helps the model.

def create_new_features(X, y):
    X['HouseAge'] = X.YrSold - X.YearBuilt
    X['AgeCond'] = X.HouseAge * X.OverallCond
    X['QuirkyCondition'] = (X.SaleCondition=='Abnorml') & (X.OverallQual < 5)


train_and_test(LinearRegression(), create_new_features)

The results? A tad better RMSE: mean RMSE with 5 folds: 0.1559. Another small victory. After every model modification, the residuals change, granting you another opportunity to analyze and iterate.

Leveraging Integer Encoding for Categorical Features

One Hot encoding is a popular technique for transforming categorical variables into binary features, especially when there’s no inherent order in the categories and their count is relatively small.

However, for ordinal features like OverallQual, where the categories follow a natural progression from “Poor” to “Excellent”, Integer (or Ordinal) Encoding would be more appropriate.

Here’s how to perform Integer Encoding on a feature:

def find_category_mappings(df, variable, target):  
  # first  we generate an ordered list with the labels
  ordered_labels = df.groupby([variable])[target].median().sort_values().index


  # return the dictionary with mappings
  return {k: i for i, k in enumerate(ordered_labels, 0)}
 
def integer_encode(df, feature):
    mapping = find_category_mappings(train, feature, 'SalePrice')
    df[feature] = df[feature].map(mapping)

The above functions rank feature values based on the median SalePrice, replacing them with their respective ranks. Consequently, unordered categorical features morph into meaningful ordinal features.

def ordinal_encode_features(X):
    integer_encode(X, 'BsmtQual')
    integer_encode(X, 'BsmtCond')
    # ... lots of others omitted for brevity ...
    integer_encode(X, 'GarageQual')
    integer_encode(X, 'GarageCond')

Ordinal Encoding is particularly useful when a categorical feature has many unique values or when creating interaction terms with that feature.

The ‘Neighborhood‘ feature is an excellent case in point. A more affluent neighborhood might have distinctive preferences for various features, which we can capture by creating interaction terms, multiplying the integer-encoded ‘Neighborhood‘ field with those features.

def neighborhood_features(X):
    X['Hood2'] = X['Neighborhood'].values
    integer_encode(X, 'Neighborhood')
   
    # neighborhood interactions
    X['HoodQual'] = X.Neighborhood * X.OverallQual
    X['HoodQual3'] = X.Neighborhood * X.BsmtQual
    # ... [add the other interaction terms here] ...
    X['HoodRooms'] = X.Neighborhood * X.TotRmsAbvGrd
    X['HoodRooms2'] = X.GrLivArea * X.BedroomAbvGr


def data_prep(X, y):
    X['HouseAge'] = X.YrSold - X.YearBuilt
    X['AgeCond'] = X.HouseAge * X.OverallCond
    X['QuirkyCondition'] = (X.SaleCondition=='Abnorml') & (X.OverallQual < 5)
    ordinal_encode_features(X)
    neighborhood_features(X)


train_and_test(RidgeCV(), data_prep)

And the result? A significant improvement in the RMSE score: mean RMSE with 5 folds: 0.1380.

Note that we used the RidgeCV model this time. Ridge regression is suitable when your data exhibits multicollinearity (high correlations among predictor variables), and it can help mitigate overfitting.

Attempting the same with LinearRegression resulted in an unsatisfactory outcome, indicating it’s time to explore more sophisticated models.

Exploring Advanced Models and Transformers Using TPOT

Tree-based Pipeline Optimization Tool (TPOT) is a Python library designed to automate the construction and optimization of machine learning pipelines. It uses genetic programming to ease the process of building complex models, especially beneficial for practitioners with limited machine learning expertise.

TPOT treats the pipeline creation as a search problem, exploring through various data pre-processing steps, feature selection techniques, model selections, and hyperparameter choices, aiming to find the optimal pipeline that maximizes the performance on your dataset.

It’s worth noting that running TPOT might take some time, but the insights obtained from its suggestions can be valuable. Particularly, it provides initial values for model hyperparameters, which can offer a significant advantage during the hyperparameter tuning process.

First step is to create a TPOTRegressor object:

from tpot import TPOTRegressor
tpot = TPOTRegressor(generations=5, population_size=20, verbosity=2)

The TPOTRegressor is designed specifically for regression tasks.

The generations parameter indicates the number of rounds the algorithm should run to find the best pipeline; a higher number typically implies a slower but potentially more accurate outcome.

population_size informs the algorithm on the number of pipelines to explore, and verbosity sets the level of output information.

Keep in mind that running TPOT can be time-consuming, especially as it’s applied across five cross-validation folds in the train_and_test function.

For instance, here is a TPOT recommendation:

Best pipeline: ExtraTreesRegressor(LassoLarsCV(input_matrix, normalize=False), bootstrap=False, max_features=0.8, min_samples_leaf=1, min_samples_split=3, n_estimators=100)

Interpreting this, you start from the center and work outward. Thus, TPOT suggests a pipeline comprising two steps:

LassoLarsCV(input_matrix, normalize=False)
ExtraTreesRegressor(bootstrap=False, max_features=0.8, min_samples_leaf=1, min_samples_split=3, n_estimators=100)

However, there’s a caveat.

A pipeline can only end with a machine learning model, and all previous steps must be transformers. Hence, not all suggestions directly fit the standard Scikit-Learn pipeline structure.

What if tpot recommends two machine learning models in its recommended pipeline?

You can stack them.

Stacking Machine Learning Models

Stacking is a technique where predictions of individual models are used as input for a final model (also known as meta-learner) to make a final prediction. Scikit-Learn offers a StackingRegressor for this purpose.

To use the StackingRegressor, we first need to initialize the base models and the final model.

Here’s an example:

# Initialize the base models
base_models = [
    ('lassolarscv', LassoLarsCV(normalize=False)),
    ('extratrees', ExtraTreesRegressor(bootstrap=False, max_features=0.8, min_samples_leaf=1, min_samples_split=3, n_estimators=100))
]


# Initialize the final model
final_model = LinearRegression()


# Create the stacking regressor
stack2 = StackingRegressor(
    estimators=base_models,
    final_estimator=final_model
)


train_and_test(stack2, data_prep)

Using this model, the RMSE has dropped to .1294, a pretty significant improvement.

Adding Scalers and Feature Selectors to the Pipeline

Machine learning pipelines can incorporate scalers and feature selectors for improved results.

Scalers transform the data to fit within a certain scale like standard deviation or minimum and maximum values, improving the performance of some machine learning models.
Feature selectors, on the other hand, can be used to reduce the dimensionality of the data by selecting the most important features.

Here is a recommendation from tpot that includes a scaler:

Best Pipeline: XGBRegressor(ElasticNetCV(RobustScaler(input_matrix), l1_ratio=0.1, tol=0.001), learning_rate=0.1, max_depth=9, min_child_weight=6, n_estimators=100, n_jobs=1, objective=reg:squarederror, subsample=0.35000000000000003, verbosity=0)

And here’s one that recommends a feature selector:

Best pipeline: RandomForestRegressor(VarianceThreshold(LassoLarsCV(input_matrix, normalize=False), 0.028), bootstrap=False, max_features=0.4, min_samples_leaf=9, min_samples_split=19, n_estimators=100)

Let’s try out these ideas.

from sklearn.preprocessing import RobustScaler
from sklearn.feature_selection import SelectFwe


def train_and_test(model, data_func=None):
    # use copies so original data isn't changed
    X_copy = X.copy()
    y_copy = y.copy()


    if data_func:
        data_func(X_copy, y_copy)


    pipe = make_pipeline(
    get_preprocessor(X_copy),
    RobustScaler(),
    VarianceThreshold(.028),
    model
  )
    pipe.fit(X_copy, y_copy)
    evaluate_model(pipe, X_copy, y_copy)


train_and_test(stack2, data_prep)

Another improvement!

Conclusion

This blog post has delved into several powerful tools and strategies that I leveraged to improve my ranking in the Kaggle House Prices competition. Here, we revisited:

The use of Pipelines and a robust “train_and_test” function to streamline the model training and evaluation process, fostering cleaner, more manageable code.
The exploration of Pandas and Seaborn libraries for brainstorming and creating new features. Data visualization, summary statistics, and feature engineering are crucial in building a comprehensive understanding of your dataset and in finding innovative ways to extract more predictive power from it.
The deployment of TPOT, a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming. It’s a great resource to generate ideas for models, transformers, and pipeline configurations.

The key is to foster a productive cycle of idea generation and rapid testing. Ensuring a clean and organized codebase can significantly ease this process. It might be a bit challenging initially, as it was for me, especially when dealing with bloated notebooks that seem impossible to debug or optimize.

However, with perseverance and the right approach, you can turn this into an enjoyable and highly rewarding journey.

Over time, you will find yourself becoming more adept at navigating through these challenges and devising effective solutions, leading to better results and a deeper understanding of machine learning concepts.

Also check out my other article you’ll probably enjoy:

The post How I Cracked the Top 100 in the Kaggle House Prices Competition appeared first on Be on the Right Side of Change.

How I Scattered My Fat with Python – Scraping and Analyzing My Nutrition Data From Cronometer.com

Charles Blue — Thu, 23 Mar 2023 09:16:08 +0000

From April 1st through August 14th, I tracked everything I ate on cronometer.com as part of a weight loss challenge. Overall I lost almost 25 pounds at a rate of 1.2 pounds per week.

I always wondered what I could learn if I could scrape that data and get it into a Jupyter Notebook. In this article, I will analyze the data and hopefully demonstrate the value of scraping and analyzing personal data.

Why cronometer.com is useful for tracking dietary information

Cronometer allows you to track your foods, biometric data, exercise, and notes. It will calculate calories and a whole host of nutritional information related to vitamins, minerals, macronutrients, amino acids, etc. It will even allow you to track important nutrient ratios such as Omega-6 to Omega-3, Potassium to Sodium, and Calcium to Magnesium.

Here is a sample of the diary page:

A handy summary of calories consumed, burned and remaining

Calories burned are based on your Basal Metabolic Rate, an estimate of calories burned based on your average daily activity level and the exercise you entered. On this day, I had 387 calories remaining, which means I had a calorie deficit of 387, which is a good day if you’re trying to lose weight.

The diary also displays a great deal of nutrient information, including vitamins, minerals, protein including amino acids, carbohydrates, and fats.

It shows the overall nutrition information for the day as a whole, and for each item in the food diary. Much information is just sitting there, waiting to be harvested.

Tools used to scrape the data

To scrape data off of an interactive site like cronometer, you need a tool that will automate interacting with the site.

The tool I used for automation was Selenium.

Selenium was great for logging in, navigating the calendar to move from day to day, and right-clicking items in the food diary to get to the detailed information. However, I used the read_html() function from the Pandas module to extract the data from the web page.

Pandas was also the main tool for the data analysis with some graphs in Seaborn. The full code can be found on the GitHub page here.

Working with Selenium

First the imports.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys

Seems like a lot of imports, but they are all necessary. The central object is the web driver. It will open a browser of your choice and automate it. So the nice thing is you can see the browser while the code is running and after. I chose Firefox for the browser. I just found it to be the easiest to work with.

URL = 'https://cronometer.com/login/'

def get_driver(url):
    driver = webdriver.Firefox()
    driver.get(URL)
    driver.maximize_window()
    driver.implicitly_wait(5)
    set_viewport_size(driver, 1920, 3200)
    
    return driver

Let’s look at driver.implicitly_wait(5).

The implicitly_wait function is used to set a default time for the driver to wait before throwing a NoSuchElementException.

Modern websites rely on code to run before all the elements are loaded. If your Selenium code gets ahead of the code behind the web page, you can get hit with the NoSuchElementException. So this default waiting time will help avoid this problem. However, there will also be times when we will also want to use explicit waits as well.

Now a few words about the set_viewport_size function, but first I will take a deep breath and spend a few moments in my happy place.

The viewport refers to the visible area of the web page in your browser. So if you try to interact with an element that is not in the viewport, you will get an error.

My first attempt to resolve this was to scroll to each element then move to the element before trying to interact with it. And this worked, most of the time. But it would occasionally error on different elements each time. Very frustrating!

But eventually, I discovered that you can set the size of the viewport. By setting the size large enough, the problem was resolved.

def set_viewport_size(driver, width, height):
    window_size = driver.execute_script("""
        return [window.outerWidth - window.innerWidth + arguments[0],
          window.outerHeight - window.innerHeight + arguments[1]];
        """, width, height)
    driver.set_window_size(*window_size)


set_viewport_size(driver, 1920, 3200)

Notice that with driver.execute_script we can run Javascript on the browser. This can be very useful.

Logging in to a web site with Selenium

def log_in(driver):
    user_name = driver.find_element(By.NAME, 'username')
    password = driver.find_element(By.NAME, 'password')
    login = driver.find_element(By.ID, 'login-button')


    user_name.send_keys('email@email.com')
    password.send_keys('***********************')
    login.click()


    # go to the diary page
    click_button(driver, DIARY_XPATH)

The By object is used to tell the driver how to find the element you want.

If you are lucky, the element can be uniquely defined by a name or an id as in this case. Filling in a form element is easy. You can just use the element.send_keys method.

Clicking the login button was a bit more complicated because I found the need to use an explicit wait to make extra sure the element is there before trying to click it.

DIARY_XPATH = '//span[contains(text(), "Diary")]'


def click_button(driver, button_xpath):
    try:
        button = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, button_xpath)))
    except Exception as e:
        print('error trying to click button', button_xpath)
        print(e)


    webdriver.ActionChains(driver).move_to_element(button).click(button).perform()

The ActionChains object allows you to chain multiple actions to an element in one statement. In this case, I move to the element before clicking it.

What is an XPATH? It’s a web scraper’s best friend and worst nightmare. From ChatGPT:

XPath is a query language used to traverse XML and HTML documents. In Selenium, XPath can be used to identify elements on a webpage by navigating the document’s hierarchy of nodes.

XPath is based on a set of rules for traversing the document tree. The tree consists of nodes, which can be either elements, attributes, text, or comments. XPath expressions are used to select nodes or sets of nodes in the tree, based on their relationship to other nodes.

In our example //span[(contains(text(), 'Diary')] can be unpacked:

//span returns all span elements in the document, regardless of location
Brackets are used to filter elements
The text() function returns the text associated with the element
The contains(text, look for this) means look for this anywhere within the text
Putting it all together span[(contains(text(), 'Diary')] means give me all span elements that have ‘Diary’ anywhere within their text. Luckily, in this case, there is only one element

So in our example, the XPATH is pretty short and identifies only the desired element. So how can XPATH become a nightmare? When I tried to create an XPATH to identify only the vitamin elements on a page.

Here the XPATH quickly becomes complicated. And I was able to find an expression that effectively filtered only the vitamins for one particular record. However, after running the web scraping process, which takes quite a long time, I found a few records where the data was just wrong.

If you right-click on the web page and choose to inspect, it will bring up the developer tools window. Then you can hit control-f to bring up a search box. This is how you can test your XPATH to see what it returns.

For example:

Here I am searching for all HTML elements in the DOM.

Why do I get back 5 elements, shouldn’t there be just one? It turns out there are entire HTML documents embedded within the DOM. And their data doesn’t necessarily match what you see on the screen.

And sometimes the XPATH expression was pulling data that didn’t match what was displayed. This means the data was wrong.

Often these documents were embedded within iFrame elements.

I tried filtering out the iFrames, but nothing I did worked 100% of the time. So how did I end up scraping the actual data? With my old friend Pandas.

Scraping data with Pandas

Pandas has a read_html method that is very powerful and simple to use. All you have to do is feed it driver.page_source and it returns a list of DataFrames. This is very convenient because DataFrames are what I used for data cleaning and data analysis.

The read_html() method searches for data in tables and is smart enough to only give you the desired data. Fortunately, all the data I need is stored in tables.

For example, on the diary page, the daily USRDA data is stored in 6 tables under the headers:

General,
Carbohydrates,
Lipids,
Protein,
Vitamins and
Minerals.

First step is to get the list of DataFrames:

tables = pd.read_html(driver.page_source)
print(f'{len(tables)} tables found')
print('shapes: ', end='')
for i in range(len(tables)):
    print(tables[i].shape, end=' ')

Output:

10 tables found
shapes: (26, 8) (5, 4) (6, 4) (9, 4) (13, 4) (13, 4) (11, 4) (10, 7) (1, 5) (7, 7)

The data we want is in tables with ids 1 – 6. So we just need to concatenate the tables and filter out the data we don’t want.

nutrients = pd.concat(tables[1:6])
nutrients.columns = ['item', 'quantity', 'units', 'percent_rda']
nutrients = nutrients.dropna()
nutrients = nutrients[nutrients.percent_rda.str.contains('%')]
nutrients.head()

By default pd.concat stacks DataFrames vertically. The dropna() method removes rows that have empty values.

The next line uses boolean indexing to filter the nutrients DataFrame to include rows where the value in the percent_rda column contains a %. This filters out nutrients like alcohol where there is no RDA.

Pandas is such a powerful and versatile tool for working with data in Python. So I was delighted to find out it can also scrape data.

However, I would like to find something to handle the automation that is a little simpler to work with than Selenium. It does get the job done; perhaps I just need more experience.

Right-clicking with Selenium

The main diary page has nutrient information for the day as a whole, but you can get nutrient information for each item in the food diary by right-clicking the item and choosing ‘details’ in the pop-up menu.

The first step is to find a way to access the food diary rows directly. For that we return to our old friend/nemesis the XPATH.

FOOD_DIARY_XPATH = "//table[@class='crono-table']//td[@class='diary-time']/parent::tr"

Unpacking the expression:

//table means give me all tables anywhere in the document
[@class=’crono-table’] means of those tables only give me the ones that contain the class ‘crono-table’
//td[@class=’diary-time’] means give me td elements that fall anywhere under the tables we got from the previous step but only if they contain the class diary-time
/parent::tr means: Ok, now let’s go up one level to the parent but only if it is a tr element.

So we can see the XPATH can pack a great deal of filtering logic into one dense compact statement. It’s a lot like regular expressions in that regard.

Likewise, we need an XPATH expression for the details row in the pop-up menu

VIEW_EDIT_XPATH = "//*[contains(text(), 'View/Edit')]"

Here the asterisk * is a wildcard. So this expression gives us any element that contains the text “View/Edit”.

Here is the code to get all the food diary elements into a list:

wait = WebDriverWait(driver, 20)
diary = []
diary_elements = wait.until(EC.visibility_of_all_elements_located((By.XPATH, FOOD_DIARY_XPATH)))
diary_elements = [wait.until(EC.element_to_be_clickable(e)) for e in diary_elements]

WebDriverWait defines an explicit wait. By explicit, this means it waits until a condition is met.

We told it to wait a maximum of 20 seconds for this condition to be met.

The first condition we look for is that all elements can be located by Selenium. If you don’t wait, your code will sometimes get ahead of the page the driver is trying to load, and you will get an error.

With the last line of code, I am using a list comprehension to make sure each diary element is actually clickable before the element is added to the final list. It is possible for an element to be visible but not yet clickable. This will lead to an error when we try to right-click the element.

Working with the calendar in cronometer

This was a fun puzzle to solve. How do you get to April 1 2021 from today using the controls to go back a year, back or forward a month, then locating the first day of the month on the calendar.

Here is what the calendar looks like:

The first step is to get to the right year and month:

last_year_xpath = "//div[contains(text(), '«')]"
next_month_xpath = "//div[contains(text(), '›')]"
last_month_xpath = "//div[contains(text(), '‹')]"


target_date = datetime.strptime(target_date, '%Y-%m-%d')
today = datetime.today()


last_year_button = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, last_year_xpath)))
next_month_button = driver.find_element(By.XPATH, next_month_xpath)
last_month_button = driver.find_element(By.XPATH, last_month_xpath)


for _ in range(today.year - target_date.year):  
    ac = webdriver.ActionChains(driver)
    ac.move_to_element(last_year_button).click(last_year_button).perform()
    time.sleep(2)


if target_date.month > today.month:
    for _ in range(target_date.month - today.month):
        next_month_button.click()
        time.sleep(2)
else:
    for _ in range(today.month - target_date.month):
        last_month_button.click()
        time.sleep(2)

Next, find the control for the day.

There are 42 days on the calendar: 3 from the previous month, 31 for the current month, and 8 for the next month.

We want the calendar element with the text “1”, but only the first one. The day controls all have a unique id starting at 100. The problem is the id for the first day of the month can vary.

while first_day_text != '1':
        first_day_id += 1
        first_day_css = f"td#calendar-date-{first_day_id}"
        first_day_div = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, first_day_css))
        )
        first_day_text = first_day_div.text


first_day_div.click()

Then after scraping data for day 1, I just have to click the tomorrow button on the calendar and do it again until I finally reach August 15, 2021.

Selenium was a bit frustrating until I got the hang of it. However, once I increased the viewport size and used explicit waits, it got the job done for website automation. The read_html function from Pandas turned out to be a lifesaver for doing the actual scraping of the data.

Data Analysis

Now for the fun part. After spending so much time scraping the data, it’s time to dive into some analysis!

Overall I lost .17 pounds per day with a standard deviation of .91 pounds. This lasted for 131 days for a total of 24.2 pounds lost.

Here is a scatter plot of Weight vs Day of Challenge including a regression line:

Wow, that is surprisingly linear! I always thought weight loss was supposed to be fast initially, then taper off.

The R-Squared value of .98 is very high. R-squared measures how well the regression line fits the data. Values range between 0 and 1.

An R-squared of 0 would indicate the regression line doesn’t fit the data at all. An R-squared of 1 indicates the regression line fits the data perfectly.

Another interpretation is 98% of the variation of weight can be explained by the day on the program. In other words, the plan worked like a charm! Slow and steady wins the race.

Here is the code for the graph above. I used the LinearRegression class from the sklearn module to create the regression line. Unfortunately, to get LinearRegression to work for simple regression with only one feature we have to reshape the data.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
import seaborn as sns

def plot_regression(data, feature, target, title):
  # sklearn expects a 2d matrix so we have to reshape pandas series
  # an array of size n is reshaped into a matrix with n rows and 1 column
  y = data[target].values.reshape(-1, 1)
  X = data[feature].values.reshape(-1, 1)
  model = LinearRegression()
  model.fit(X, y)


  # get slope and intercept from model
  slope = model.coef_[0][0]
  intercept = model.intercept_[0]


  # use slope and intercept to create predictions
  weight_pred = intercept + slope * X.reshape(-1)


  # use R2 score to compare predictions to true values
  r2 = r2_score(data[target], weight_pred)


  # plot
  plt.figure(figsize=(12,8))
  sns.scatterplot(x=feature, y=target, data=data)
  plt.plot(X.reshape(-1), weight_pred, linewidth=1, color='r', label=f'y={slope:.2f} * x + {intercept:.1f}')


  # add a second row to the title to display R2
  plt.title(title + f'\nr2 = {r2:.2f} ')

The fact that the mean daily weight loss is only .17 pounds with a relatively large standard deviation of .98 pounds leads to some short-term results that can be quite frustrating.

For example, here is a two-week stretch where it felt like nothing was working:

For comparison, here is a two-week stretch where everything seemed easy:

So slow and steady may win the race, but it can often feel like losing. The trick is to have faith in the plan and keep on truckin’.

We can use a histogram to look at the distribution of weekly weight loss amounts:

More good weeks than bad, and the best week dominates the worst week in absolute value: 3.5 pounds lost vs 1.5 pounds gained. There were enough positive results to stay motivated.

What if I repeated this challenge many times? What would the range of values for average weekly weight loss look like?

I can’t very well replicate the experiment 1,000 times, but I can estimate a 95% confidence interval using the bootstrap method.

This uses resampling with replacement to generate hypothetical samples which can be used to create a confidence interval. Because we are resampling with replacement some values can occur more than once in a given sample and others not at all.

This means we can generate samples from our data that are different from each other but still pulled from the same original data.

Assuming the factors leading to my current data hold, I believe I am 95% certain if I replicated this experiment, I would lose somewhere from a pound to almost a pound and a half a week.

This also matches my previous experience. In previous weight loss challenges, I lost weight at a little over a pound a week. The fancy bootstrap method just makes it official.

Looking at total calories over time

My daily goal was to hit a caloric deficit of at least 200 calories. Luckily cronometer will help you calculate an estimate of the number of calories you burn on a typical day.

It will measure your Basal Metabolic Rate and estimate how many calories you burn each day through activity. For me, the total number is 2218 calories per day.

If I eat this amount, I should maintain my weight. If I consistently eat less, I should lose weight. 2000 was a good round number to try and hit each day. So how did I do?

I struggled to hit my daily target early in the challenge. This may explain why I didn’t experience more rapid weight loss at the start. Luckily most days were below the break-even point of 2218 calories so I still lost weight.

After day 50, I hit the target most days. This shows I got better at eating less calories over time. Overall The total calories were not consistent at all, but they didn’t need to be. What seems to matter is the long run average.

In hindsight, 2000 calories is still a good target even though I can’t expect to hit it every day. By setting a mildly ambitious target, I set up a situation where I can fail a little bit and still be Ok.

Correlations

We know that days correlate very highly with weight but what about calories? What other interesting correlations might we find?

We can use a correlation heat map to find out. For calories, I added some calculated fields to make it interesting.

yesterday_total_calories – total calories offset one day in the past
total_calories_7dma – average calories for the previous 7 days
total_calories_14dma – average calories for the previous 14 days
total_calories_21dma – average calories for the previous 21 days

The reason for adding the moving averages is to smooth out the day-to-day variation.

Here is the code to create the heatmap:

def correlation_heatmap(df, title):
    corr = df.corr()


    # Generate a mask for the upper triangle
    mask = np.zeros_like(corr, dtype=bool)
    mask[np.triu_indices_from(mask)] = True


    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(11, 9))


    # Draw the heatmap with the mask
    sns.heatmap(corr, mask=mask, cmap='BuPu', center=0,
                square=True, linewidths=.5, cbar_kws={"shrink": .5},
                annot=True)
    plt.title(title)
    plt.show()

As expected, the longer the time frame for the moving average, the higher the correlation between past calorie consumption and the current day’s weight.

Does this mean that what I ate 14 days ago affects my weight today?

I don’t think so. I think a lot of things, such as hydration levels, can affect your weight at any given point in time. But that averages out, in the long run, leaving total calories as the dominating factor determining body weight.

Good Days, Bad Days

You know I’ve had my share. What did I eat on bad days vs good days?

I defined a bad day as any day I had a caloric surplus > 100 calories. It turns out I had 20 bad days, that’s 15% of the days in the challenge. That’s a lot more than I remember.

Damn you sourdough, damn you straight to hell! Why do you have to taste so good? I don’t miss the other foods I’ve given up like frozen pizza, chips, cookies, soda, ice cream etc. But do I have to give up that fluffy slice of heaven known as sourdough bread? Apparently so. They say you can lose weight without giving up the foods you love. They lie. As the immortal Jack LaLanne once said “If it tastes good, spit it out!”

Why would decaf coffee show up on this list? I used a high-calorie creamer and drank extra cups on bad days. And it’s also something I drank pretty much every day.

For comparison, let’s look at the top calorie sources on good days, defined as any day with a calorie deficit > 100 calories.

Boiled potatoes, quinoa, tofu, bananas, and sardines. Doesn’t sound very appetizing does it?

Apparently that’s why they work as weight-loss foods. Oh well, at least I have beer. It is a matter of pride that I could have one beer a day and still lose weight. I really looked forward to that beer every day. The sardines, not so much.

Why does tofu work well as a weight-loss food? It’s high in protein, and it sits in your stomach like a brick. And it won’t stimulate your appetite. Boiled potatoes are similarly filling due to the high water content. Most people think potatoes are a fattening food, but I think it’s all in how they are prepared. If you fry them in oil and smother them in salt, then absolutely they become junk food: dense in calories and overstimulating to the appetite.

Really, Really good days

There were 4 days where I was able to eat less than 1400 calories total. What did those days look like?

Honey water???

Basically, that’s just herbal tea sweetened with honey. Apparently, I drank a lot on those days. Makes sense to fill up on liquids when trying to lose weight.

And I think sipping on herbal tea also distracted me from the fact that I wasn’t eating as much. And consider that a 12 ounce can of Coke has 39 grams of sugar, whereas a teaspoon of honey only has 5.6 grams of sugar. 39 grams of sugar is about 10 teaspoons worth!

I couldn’t imagine adding 10 teaspoons of sugar to a mug of tea, or to any drink for that matter. I couldn’t even imagine adding 10 teaspoons of sugar to a bowl of Cheerios. What happens when someone gets used to that much sugar? Healthy foods won’t taste sweet enough any more.

Which foods are most nutritious?

I created a nutrient score by adding up the percent of US RDA for the vitamins and minerals for each food item in my diary divided by the number of calories.

The units for each food item is just how much I ate that day. So I’m looking at which foods contributed the most to meeting my nutrient needs for the least number of calories.

Greens for the win! Adding a variety of leafy greens each day is a really good idea. And spinach tastes pretty good as long as it’s fresh, especially baby spinach. Cilantro also adds an interesting flavor.

What about sodium?

Sodium is one nutrient you don’t want to get too much of. Unfortunately, the sodium content in processed foods is very high. There were 30 days where I got more than 150% of the US RDA (Recommended Daily Allowance) of sodium, and 10 days I got higher than 200%!

What foods did I eat that were highest in sodium?

You’re killin’ me Trader Joe!

Basically all these are convenience foods that taste pretty good. The cost is too much sodium and calories. This brings to mind another Jack LaLanne quote “If man makes it, don’t eat it”. The good news is if I don’t eat these foods, I can afford to add some salt to my dinner.

A bit of salt does wonders for the taste of foods like quinoa.

Conclusions

I was able to create a real-world regression model with only one feature that is extremely accurate.

All I need is the starting date and the number of days into the weight loss regimen and I can predict how much weight I lost with a high degree of accuracy. An R-squared of .98 is pretty darn good! The only caveat is the model is only going to be accurate after about 3 weeks.

I also learned a lot from analyzing the data after the fact. I was surprised at the number of times I actually failed to meet my daily targets. Yet the encouraging thing is it doesn’t matter! As long as I succeed more than fail and my successes are greater than my failures, the plan will work. And there is no need to try and hit an exact calorie amount each and every day.

I also learned a good bit about foods that work for me versus the ones that don’t. The key is to process your own food. If you allow Coca-Cola and Nabisco to do it for you, they will pack in the calories and make the food over-palatable, encouraging you to overeat. The key is learning to appreciate the subtle taste of healthy food vs the overwhelming taste of junk food. What makes food taste better? Salt, sugars, and fat. You want to be the one controlling the amounts. If you know how to cook, there is also texture, presentation, herbs and spices, etc. Guess I need to learn to cook!

As a final note, it’s fascinating how well the conclusions I’ve drawn from the data match ancient wisdom. Here’s an example from way back in the mid 1900s:

The post How I Scattered My Fat with Python – Scraping and Analyzing My Nutrition Data From Cronometer.com appeared first on Be on the Right Side of Change.

How to Create a DataFrame From Lists?

Charles Blue — Sat, 17 Dec 2022 08:39:56 +0000

Pandas is a great library for data analysis in Python. With Pandas, you can create visualizations, filter rows or columns, add new columns, and save the data in a wide range of formats. The workhorse of Pandas is the DataFrame.

Recommended: 10 Minutes to Pandas (in 5 Minutes)

So the first step working with Pandas is often to get our data into a DataFrame. If we have data stored in lists, how can we create this all-powerful DataFrame?

There are 4 basic strategies:

Create a dictionary with column names as keys and your lists as values. Pass this dictionary as an argument when creating the DataFrame.
Pass your lists into the zip() function. As with strategy 1, your lists will become columns in the DataFrame.
Put your lists into a list instead of a dictionary. In this case, your lists become rows instead of columns.
Create an empty DataFrame and add columns one by one.

Method 1: Create a DataFrame using a Dictionary

The first step is to import pandas. If you haven’t already, install pandas first.

import pandas as pd

Let’s say you have employee data stored as lists.

# if your data is stored like this
employee = ['Betty', 'Veronica', 'Archie', 'Jughead']
salary = [110_000, 20_000, 80_000, 70_000]
bonus = [1000, 500, 2500, 400]
tax_rate = [.1, .25, .17, .4]
absences = [0, 1, 0, 52]

Build a dictionary using column names as keys and your lists as values.

# you can easily create a dictionary that will define your dataframe
emp_data = {
    'name': employee,
    'salary': salary,
    'bonus': bonus,
    'tax_rate': tax_rate,
    'absences': absences
}

Your lists will become columns in the resulting DataFrame.

Create a DataFrame using the zip function

Pass each list as a separate argument to the zip() function. You can specify the column names using the columns parameter or by setting the columns property on a separate line.

emp_df = pd.DataFrame(zip(employee, salary, bonus, tax_rate, absences))
emp_df.columns = ['name', 'salary', 'bonus', 'tax_rate', 'absences']

The zip() function creates an iterator. For the first iteration, it grabs every value at index 0 from each list. This becomes the first row in the DataFrame. Next, it grabs every value at index 1 and this becomes the second row. This continues until it exhausts the shortest list.

We can loop thru the iterator to see how this works.

i = 0
for value in zip(employee, salary, bonus, tax_rate, absences):
  print(f'zipped value at index {i}: {value}')
  i += 1

Each of these values becomes a row in the DataFrame:

zipped value at index 0: ('Betty', 110000, 1000, 0.1, 0)
zipped value at index 1: ('Veronica', 20000, 500, 0.25, 1)
zipped value at index 2: ('Archie', 80000, 2500, 0.17, 0)
zipped value at index 3: ('Jughead', 70000, 400, 0.4, 52)

Create a DataFrame using a list of lists

What if you have a separate list for each employee? In this case, we can just create a list of lists. Each of the inner lists becomes a row in the DataFrame.

# lists for employees instead of features
betty = ['Betty', 110000, 1000, 0.1, 0]
veronica = ['Veronica', 20000, 500, 0.25, 1]
archie = ['Archie', 80000, 2500, 0.17, 0]
jughead = ['Jughead', 70000, 400, 0.4, 52]

emp_df = pd.DataFrame([betty, veronica, archie, jughead])
emp_df.columns = ['name', 'salary', 'bonus', 'tax_rate', 'absences']
emp_df

Create a DataFrame using a list of dictionaries

If the employee data is stored in dictionaries instead of lists, we use a list of dictionaries.

betty = {'name': 'Betty', 'salary': 110000, 'bonus': 1000, 
         'tax_rate': 0.1, 'absences': 0}

veronica = {'name': 'Veronica', 'salary': 20000, 'bonus': 500, 
            'tax_rate': 0.25, 'absences': 1}

archie = {'name': 'Archie', 'salary': 80000, 'bonus': 2500, 
          'tax_rate': 0.17, 'absences': 0}
          
jughead = {'name': 'Jughead', 'salary': 70000, 'bonus': 400, 
           'tax_rate': 0.4, 'absences': 52}

pd.DataFrame([betty, veronica, archie, jughead])

The columns are determined by the keys in the dictionaries. What if the dictionaries don’t all have the same keys?

betty = {'name': 'Betty', 'salary': 110000, 'bonus': 1000, 
         'tax_rate': 0.1, 'absences': 0, 'hire_date': '2001-01-01'}

veronica = {'name': 'Veronica', 'salary': 20000, 'bonus': 500, 
            'tax_rate': 0.25, 'absences': 1}

archie = {'name': 'Archie', 'salary': 80000, 'bonus': 2500, 
          'tax_rate': 0.17, 'absences': 0, 'title': 'Vice Chief Leader'}
          
jughead = {'name': 'Jughead', 'salary': 70000, 'bonus': 400,      
           'tax_rate': 0.4, 'absences': 52, 'rank': 'yes'}

pd.DataFrame([betty, veronica, archie, jughead])

All of the keys will be used. Anytime pandas encounters a dictionary with a missing key, the missing value will be replaced with NaN which stands for ‘not a number’.

Create an empty DataFrame and add columns one by one

This method might be preferable if you needed to create a lot of new calculated columns. Here we create a new column for after-tax income.

emp_df = pd.DataFrame()
emp_df['name'] = employee
emp_df['salary'] = salary
emp_df['bonus'] = bonus
emp_df['tax_rate'] = tax_rate
emp_df['absences'] = absences

income = emp_df['salary'] + emp_df['bonus']
emp_df['after_tax'] = income * (1 - emp_df['tax_rate'])

How to add a list to an existing DataFrame

Here is a neat trick. If you want to edit a row in a DataFrame you can use the handy loc method. Loc allows you to access rows and columns by their index value.

To access a row:

emp_df.loc[3]

Output is the row with index value 3 as a Series:

name        Jughead
salary        70000
bonus           400
tax_rate        0.4
absences         52
Name: 3, dtype: object

To access a column just pass in the column name as the index. Note that we have to specify the row and column indexes. The format is [rows, columns]. If you want all rows you can use “:” as we do here. The : also works if you want all columns.

emp_df.loc[:, 'salary']

Output is also a series

0    110000
1     20000
2     80000
3     70000
4    200000
Name: salary, dtype: int64

So how do we use loc to add a new row? If we use a row index that doesn’t exist in the DataFrame, it will create a new row for us.

new_emp = ['Fonzie', 200000, 30000, .05, 112]
emp_df.loc[4] = new_emp
emp_df

You can also update existing data with loc. Let’s drop Fonzie’s salary. It looks a bit excessive.

emp_df.loc[4, 'salary'] = 105000
emp_df

That’s more like it.

Conclusion

There are many different ways of creating a DataFrame. We looked at several methods using data stored in lists. Each will get the job done.

The most convenient method will depend on what your lists represent.

If each of your lists would best be represented as a column, then a dictionary of lists might be the easiest way to go.

If each of your lists would best be represented as a row, then a list of lists would be a good choice.

To add data in a list as a new row in an existing DataFrame, the loc method comes in handy. Loc is also useful for updating existing data.

The post How to Create a DataFrame From Lists? appeared first on Be on the Right Side of Change.