How I Scattered My Fat with Python - Scraping and Analyzing My Nutrition Data From Cronometer.com

From April 1st through August 14th, I tracked everything I ate on cronometer.com as part of a weight loss challenge. Overall I lost almost 25 pounds at a rate of 1.2 pounds per week.

I always wondered what I could learn if I could scrape that data and get it into a Jupyter Notebook. In this article, I will analyze the data and hopefully demonstrate the value of scraping and analyzing personal data.

Why cronometer.com is useful for tracking dietary information

Cronometer allows you to track your foods, biometric data, exercise, and notes. It will calculate calories and a whole host of nutritional information related to vitamins, minerals, macronutrients, amino acids, etc. It will even allow you to track important nutrient ratios such as Omega-6 to Omega-3, Potassium to Sodium, and Calcium to Magnesium.

Here is a sample of the diary page:

A handy summary of calories consumed, burned and remaining

Calories burned are based on your Basal Metabolic Rate, an estimate of calories burned based on your average daily activity level and the exercise you entered. On this day, I had 387 calories remaining, which means I had a calorie deficit of 387, which is a good day if you’re trying to lose weight. 💪

The diary also displays a great deal of nutrient information, including vitamins, minerals, protein including amino acids, carbohydrates, and fats.

It shows the overall nutrition information for the day as a whole, and for each item in the food diary. Much information is just sitting there, waiting to be harvested.

Tools used to scrape the data

To scrape data off of an interactive site like cronometer, you need a tool that will automate interacting with the site.

The tool I used for automation was Selenium.

💡 Selenium was great for logging in, navigating the calendar to move from day to day, and right-clicking items in the food diary to get to the detailed information. However, I used the read_html() function from the Pandas module to extract the data from the web page.

Pandas was also the main tool for the data analysis with some graphs in Seaborn. The full code can be found on the GitHub page here.

Working with Selenium

First the imports.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys

Seems like a lot of imports, but they are all necessary. The central object is the web driver. It will open a browser of your choice and automate it. So the nice thing is you can see the browser while the code is running and after. I chose Firefox for the browser. I just found it to be the easiest to work with.

URL = 'https://cronometer.com/login/'

def get_driver(url):
    driver = webdriver.Firefox()
    driver.get(URL)
    driver.maximize_window()
    driver.implicitly_wait(5)
    set_viewport_size(driver, 1920, 3200)
    
    return driver

Let’s look at driver.implicitly_wait(5).

The implicitly_wait function is used to set a default time for the driver to wait before throwing a NoSuchElementException.

Modern websites rely on code to run before all the elements are loaded. If your Selenium code gets ahead of the code behind the web page, you can get hit with the NoSuchElementException. So this default waiting time will help avoid this problem. However, there will also be times when we will also want to use explicit waits as well.

Now a few words about the set_viewport_size function, but first I will take a deep breath and spend a few moments in my happy place.

The viewport refers to the visible area of the web page in your browser. So if you try to interact with an element that is not in the viewport, you will get an error.

My first attempt to resolve this was to scroll to each element then move to the element before trying to interact with it. And this worked, most of the time. But it would occasionally error on different elements each time. Very frustrating!

But eventually, I discovered that you can set the size of the viewport. By setting the size large enough, the problem was resolved.

def set_viewport_size(driver, width, height):
    window_size = driver.execute_script("""
        return [window.outerWidth - window.innerWidth + arguments[0],
          window.outerHeight - window.innerHeight + arguments[1]];
        """, width, height)
    driver.set_window_size(*window_size)


set_viewport_size(driver, 1920, 3200)

Notice that with driver.execute_script we can run Javascript on the browser. This can be very useful.

Logging in to a web site with Selenium

def log_in(driver):
    user_name = driver.find_element(By.NAME, 'username')
    password = driver.find_element(By.NAME, 'password')
    login = driver.find_element(By.ID, 'login-button')


    user_name.send_keys('email@email.com')
    password.send_keys('***********************')
    login.click()


    # go to the diary page
    click_button(driver, DIARY_XPATH)

The By object is used to tell the driver how to find the element you want.

If you are lucky, the element can be uniquely defined by a name or an id as in this case. Filling in a form element is easy. You can just use the element.send_keys method.

Clicking the login button was a bit more complicated because I found the need to use an explicit wait to make extra sure the element is there before trying to click it.

DIARY_XPATH = '//span[contains(text(), "Diary")]'


def click_button(driver, button_xpath):
    try:
        button = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, button_xpath)))
    except Exception as e:
        print('error trying to click button', button_xpath)
        print(e)


    webdriver.ActionChains(driver).move_to_element(button).click(button).perform()

The ActionChains object allows you to chain multiple actions to an element in one statement. In this case, I move to the element before clicking it.

What is an XPATH? It’s a web scraper’s best friend and worst nightmare. From ChatGPT:

XPath is a query language used to traverse XML and HTML documents. In Selenium, XPath can be used to identify elements on a webpage by navigating the document’s hierarchy of nodes.

XPath is based on a set of rules for traversing the document tree. The tree consists of nodes, which can be either elements, attributes, text, or comments. XPath expressions are used to select nodes or sets of nodes in the tree, based on their relationship to other nodes.

In our example //span[(contains(text(), 'Diary')] can be unpacked:

//span returns all span elements in the document, regardless of location
Brackets are used to filter elements
The text() function returns the text associated with the element
The contains(text, look for this) means look for this anywhere within the text
Putting it all together span[(contains(text(), 'Diary')] means give me all span elements that have ‘Diary’ anywhere within their text. Luckily, in this case, there is only one element

So in our example, the XPATH is pretty short and identifies only the desired element. So how can XPATH become a nightmare? When I tried to create an XPATH to identify only the vitamin elements on a page.

Here the XPATH quickly becomes complicated. And I was able to find an expression that effectively filtered only the vitamins for one particular record. However, after running the web scraping process, which takes quite a long time, I found a few records where the data was just wrong.

If you right-click on the web page and choose to inspect, it will bring up the developer tools window. Then you can hit control-f to bring up a search box. This is how you can test your XPATH to see what it returns.

For example:

Here I am searching for all HTML elements in the DOM.

Why do I get back 5 elements, shouldn’t there be just one? It turns out there are entire HTML documents embedded within the DOM. And their data doesn’t necessarily match what you see on the screen.

And sometimes the XPATH expression was pulling data that didn’t match what was displayed. This means the data was wrong.

Often these documents were embedded within iFrame elements.

I tried filtering out the iFrames, but nothing I did worked 100% of the time. So how did I end up scraping the actual data? With my old friend Pandas.

Scraping data with Pandas

Pandas has a read_html method that is very powerful and simple to use. All you have to do is feed it driver.page_source and it returns a list of DataFrames. This is very convenient because DataFrames are what I used for data cleaning and data analysis.

The read_html() method searches for data in tables and is smart enough to only give you the desired data. Fortunately, all the data I need is stored in tables.

For example, on the diary page, the daily USRDA data is stored in 6 tables under the headers:

General,
Carbohydrates,
Lipids,
Protein,
Vitamins and
Minerals.

First step is to get the list of DataFrames:

tables = pd.read_html(driver.page_source)
print(f'{len(tables)} tables found')
print('shapes: ', end='')
for i in range(len(tables)):
    print(tables[i].shape, end=' ')

Output:

10 tables found
shapes: (26, 8) (5, 4) (6, 4) (9, 4) (13, 4) (13, 4) (11, 4) (10, 7) (1, 5) (7, 7)

The data we want is in tables with ids 1 – 6. So we just need to concatenate the tables and filter out the data we don’t want.

nutrients = pd.concat(tables[1:6])
nutrients.columns = ['item', 'quantity', 'units', 'percent_rda']
nutrients = nutrients.dropna()
nutrients = nutrients[nutrients.percent_rda.str.contains('%')]
nutrients.head()

By default pd.concat stacks DataFrames vertically. The dropna() method removes rows that have empty values.

The next line uses boolean indexing to filter the nutrients DataFrame to include rows where the value in the percent_rda column contains a %. This filters out nutrients like alcohol where there is no RDA.

Pandas is such a powerful and versatile tool for working with data in Python. So I was delighted to find out it can also scrape data.

However, I would like to find something to handle the automation that is a little simpler to work with than Selenium. It does get the job done; perhaps I just need more experience.

Right-clicking with Selenium

The main diary page has nutrient information for the day as a whole, but you can get nutrient information for each item in the food diary by right-clicking the item and choosing ‘details’ in the pop-up menu.

The first step is to find a way to access the food diary rows directly. For that we return to our old friend/nemesis the XPATH.

FOOD_DIARY_XPATH = "//table[@class='crono-table']//td[@class='diary-time']/parent::tr"

Unpacking the expression:

//table means give me all tables anywhere in the document
[@class=’crono-table’] means of those tables only give me the ones that contain the class ‘crono-table’
//td[@class=’diary-time’] means give me td elements that fall anywhere under the tables we got from the previous step but only if they contain the class diary-time
/parent::tr means: Ok, now let’s go up one level to the parent but only if it is a tr element.

So we can see the XPATH can pack a great deal of filtering logic into one dense compact statement. It’s a lot like regular expressions in that regard.

Likewise, we need an XPATH expression for the details row in the pop-up menu

VIEW_EDIT_XPATH = "//*[contains(text(), 'View/Edit')]"

Here the asterisk * is a wildcard. So this expression gives us any element that contains the text “View/Edit”.

Here is the code to get all the food diary elements into a list:

wait = WebDriverWait(driver, 20)
diary = []
diary_elements = wait.until(EC.visibility_of_all_elements_located((By.XPATH, FOOD_DIARY_XPATH)))
diary_elements = [wait.until(EC.element_to_be_clickable(e)) for e in diary_elements]

WebDriverWait defines an explicit wait. By explicit, this means it waits until a condition is met.

We told it to wait a maximum of 20 seconds for this condition to be met.

The first condition we look for is that all elements can be located by Selenium. If you don’t wait, your code will sometimes get ahead of the page the driver is trying to load, and you will get an error.

With the last line of code, I am using a list comprehension to make sure each diary element is actually clickable before the element is added to the final list. It is possible for an element to be visible but not yet clickable. This will lead to an error when we try to right-click the element.

Working with the calendar in cronometer

This was a fun puzzle to solve. How do you get to April 1 2021 from today using the controls to go back a year, back or forward a month, then locating the first day of the month on the calendar.

Here is what the calendar looks like:

The first step is to get to the right year and month:

last_year_xpath = "//div[contains(text(), '«')]"
next_month_xpath = "//div[contains(text(), '›')]"
last_month_xpath = "//div[contains(text(), '‹')]"


target_date = datetime.strptime(target_date, '%Y-%m-%d')
today = datetime.today()


last_year_button = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, last_year_xpath)))
next_month_button = driver.find_element(By.XPATH, next_month_xpath)
last_month_button = driver.find_element(By.XPATH, last_month_xpath)


for _ in range(today.year - target_date.year):  
    ac = webdriver.ActionChains(driver)
    ac.move_to_element(last_year_button).click(last_year_button).perform()
    time.sleep(2)


if target_date.month > today.month:
    for _ in range(target_date.month - today.month):
        next_month_button.click()
        time.sleep(2)
else:
    for _ in range(today.month - target_date.month):
        last_month_button.click()
        time.sleep(2)

Next, find the control for the day.

There are 42 days on the calendar: 3 from the previous month, 31 for the current month, and 8 for the next month.

We want the calendar element with the text “1”, but only the first one. The day controls all have a unique id starting at 100. The problem is the id for the first day of the month can vary.

while first_day_text != '1':
        first_day_id += 1
        first_day_css = f"td#calendar-date-{first_day_id}"
        first_day_div = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, first_day_css))
        )
        first_day_text = first_day_div.text


first_day_div.click()

Then after scraping data for day 1, I just have to click the tomorrow button on the calendar and do it again until I finally reach August 15, 2021.

Selenium was a bit frustrating until I got the hang of it. However, once I increased the viewport size and used explicit waits, it got the job done for website automation. The read_html function from Pandas turned out to be a lifesaver for doing the actual scraping of the data.

Data Analysis

Now for the fun part. After spending so much time scraping the data, it’s time to dive into some analysis!

Overall I lost .17 pounds per day with a standard deviation of .91 pounds. This lasted for 131 days for a total of 24.2 pounds lost.

Here is a scatter plot of Weight vs Day of Challenge including a regression line:

Wow, that is surprisingly linear! I always thought weight loss was supposed to be fast initially, then taper off.

The R-Squared value of .98 is very high. R-squared measures how well the regression line fits the data. Values range between 0 and 1.

An R-squared of 0 would indicate the regression line doesn’t fit the data at all. An R-squared of 1 indicates the regression line fits the data perfectly.

Another interpretation is 98% of the variation of weight can be explained by the day on the program. In other words, the plan worked like a charm! Slow and steady wins the race.

Here is the code for the graph above. I used the LinearRegression class from the sklearn module to create the regression line. Unfortunately, to get LinearRegression to work for simple regression with only one feature we have to reshape the data.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
import seaborn as sns

def plot_regression(data, feature, target, title):
  # sklearn expects a 2d matrix so we have to reshape pandas series
  # an array of size n is reshaped into a matrix with n rows and 1 column
  y = data[target].values.reshape(-1, 1)
  X = data[feature].values.reshape(-1, 1)
  model = LinearRegression()
  model.fit(X, y)


  # get slope and intercept from model
  slope = model.coef_[0][0]
  intercept = model.intercept_[0]


  # use slope and intercept to create predictions
  weight_pred = intercept + slope * X.reshape(-1)


  # use R2 score to compare predictions to true values
  r2 = r2_score(data[target], weight_pred)


  # plot
  plt.figure(figsize=(12,8))
  sns.scatterplot(x=feature, y=target, data=data)
  plt.plot(X.reshape(-1), weight_pred, linewidth=1, color='r', label=f'y={slope:.2f} * x + {intercept:.1f}')


  # add a second row to the title to display R2
  plt.title(title + f'\nr2 = {r2:.2f} ')

The fact that the mean daily weight loss is only .17 pounds with a relatively large standard deviation of .98 pounds leads to some short-term results that can be quite frustrating.

For example, here is a two-week stretch where it felt like nothing was working:

For comparison, here is a two-week stretch where everything seemed easy:

So slow and steady may win the race, but it can often feel like losing. The trick is to have faith in the plan and keep on truckin’.

We can use a histogram to look at the distribution of weekly weight loss amounts:

More good weeks than bad, and the best week dominates the worst week in absolute value: 3.5 pounds lost vs 1.5 pounds gained. There were enough positive results to stay motivated.

What if I repeated this challenge many times? What would the range of values for average weekly weight loss look like?

I can’t very well replicate the experiment 1,000 times, but I can estimate a 95% confidence interval using the bootstrap method.

This uses resampling with replacement to generate hypothetical samples which can be used to create a confidence interval. Because we are resampling with replacement some values can occur more than once in a given sample and others not at all.

This means we can generate samples from our data that are different from each other but still pulled from the same original data.

Assuming the factors leading to my current data hold, I believe I am 95% certain if I replicated this experiment, I would lose somewhere from a pound to almost a pound and a half a week.

This also matches my previous experience. In previous weight loss challenges, I lost weight at a little over a pound a week. The fancy bootstrap method just makes it official.

Looking at total calories over time

My daily goal was to hit a caloric deficit of at least 200 calories. Luckily cronometer will help you calculate an estimate of the number of calories you burn on a typical day.

It will measure your Basal Metabolic Rate and estimate how many calories you burn each day through activity. For me, the total number is 2218 calories per day.

If I eat this amount, I should maintain my weight. If I consistently eat less, I should lose weight. 2000 was a good round number to try and hit each day. So how did I do?

I struggled to hit my daily target early in the challenge. This may explain why I didn’t experience more rapid weight loss at the start. Luckily most days were below the break-even point of 2218 calories so I still lost weight.

After day 50, I hit the target most days. This shows I got better at eating less calories over time. Overall The total calories were not consistent at all, but they didn’t need to be. What seems to matter is the long run average.

In hindsight, 2000 calories is still a good target even though I can’t expect to hit it every day. By setting a mildly ambitious target, I set up a situation where I can fail a little bit and still be Ok.

Correlations

We know that days correlate very highly with weight but what about calories? What other interesting correlations might we find?

We can use a correlation heat map to find out. For calories, I added some calculated fields to make it interesting.

yesterday_total_calories – total calories offset one day in the past
total_calories_7dma – average calories for the previous 7 days
total_calories_14dma – average calories for the previous 14 days
total_calories_21dma – average calories for the previous 21 days

The reason for adding the moving averages is to smooth out the day-to-day variation.

Here is the code to create the heatmap:

def correlation_heatmap(df, title):
    corr = df.corr()


    # Generate a mask for the upper triangle
    mask = np.zeros_like(corr, dtype=bool)
    mask[np.triu_indices_from(mask)] = True


    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(11, 9))


    # Draw the heatmap with the mask
    sns.heatmap(corr, mask=mask, cmap='BuPu', center=0,
                square=True, linewidths=.5, cbar_kws={"shrink": .5},
                annot=True)
    plt.title(title)
    plt.show()

As expected, the longer the time frame for the moving average, the higher the correlation between past calorie consumption and the current day’s weight.

Does this mean that what I ate 14 days ago affects my weight today?

I don’t think so. I think a lot of things, such as hydration levels, can affect your weight at any given point in time. But that averages out, in the long run, leaving total calories as the dominating factor determining body weight.

Good Days, Bad Days

You know I’ve had my share. What did I eat on bad days vs good days?

I defined a bad day as any day I had a caloric surplus > 100 calories. It turns out I had 20 bad days, that’s 15% of the days in the challenge. That’s a lot more than I remember.

Damn you sourdough, damn you straight to hell! Why do you have to taste so good? I don’t miss the other foods I’ve given up like frozen pizza, chips, cookies, soda, ice cream etc. But do I have to give up that fluffy slice of heaven known as sourdough bread? Apparently so. They say you can lose weight without giving up the foods you love. They lie. As the immortal Jack LaLanne once said “If it tastes good, spit it out!”

Why would decaf coffee show up on this list? I used a high-calorie creamer and drank extra cups on bad days. And it’s also something I drank pretty much every day.

For comparison, let’s look at the top calorie sources on good days, defined as any day with a calorie deficit > 100 calories.

Boiled potatoes, quinoa, tofu, bananas, and sardines. Doesn’t sound very appetizing does it?

Apparently that’s why they work as weight-loss foods. Oh well, at least I have beer. It is a matter of pride that I could have one beer a day and still lose weight. I really looked forward to that beer every day. The sardines, not so much.

Why does tofu work well as a weight-loss food? It’s high in protein, and it sits in your stomach like a brick. And it won’t stimulate your appetite. Boiled potatoes are similarly filling due to the high water content. Most people think potatoes are a fattening food, but I think it’s all in how they are prepared. If you fry them in oil and smother them in salt, then absolutely they become junk food: dense in calories and overstimulating to the appetite.

Really, Really good days

There were 4 days where I was able to eat less than 1400 calories total. What did those days look like?

Honey water???

Basically, that’s just herbal tea sweetened with honey. Apparently, I drank a lot on those days. Makes sense to fill up on liquids when trying to lose weight.

And I think sipping on herbal tea also distracted me from the fact that I wasn’t eating as much. And consider that a 12 ounce can of Coke has 39 grams of sugar, whereas a teaspoon of honey only has 5.6 grams of sugar. 39 grams of sugar is about 10 teaspoons worth!

I couldn’t imagine adding 10 teaspoons of sugar to a mug of tea, or to any drink for that matter. I couldn’t even imagine adding 10 teaspoons of sugar to a bowl of Cheerios. What happens when someone gets used to that much sugar? Healthy foods won’t taste sweet enough any more.

Which foods are most nutritious?

I created a nutrient score by adding up the percent of US RDA for the vitamins and minerals for each food item in my diary divided by the number of calories.

The units for each food item is just how much I ate that day. So I’m looking at which foods contributed the most to meeting my nutrient needs for the least number of calories.

Greens for the win! Adding a variety of leafy greens each day is a really good idea. And spinach tastes pretty good as long as it’s fresh, especially baby spinach. Cilantro also adds an interesting flavor.

What about sodium?

Sodium is one nutrient you don’t want to get too much of. Unfortunately, the sodium content in processed foods is very high. There were 30 days where I got more than 150% of the US RDA (Recommended Daily Allowance) of sodium, and 10 days I got higher than 200%!

What foods did I eat that were highest in sodium?

You’re killin’ me Trader Joe!

Basically all these are convenience foods that taste pretty good. The cost is too much sodium and calories. This brings to mind another Jack LaLanne quote “If man makes it, don’t eat it”. The good news is if I don’t eat these foods, I can afford to add some salt to my dinner.

A bit of salt does wonders for the taste of foods like quinoa.

Conclusions

I was able to create a real-world regression model with only one feature that is extremely accurate.

All I need is the starting date and the number of days into the weight loss regimen and I can predict how much weight I lost with a high degree of accuracy. An R-squared of .98 is pretty darn good! The only caveat is the model is only going to be accurate after about 3 weeks.

I also learned a lot from analyzing the data after the fact. I was surprised at the number of times I actually failed to meet my daily targets. Yet the encouraging thing is it doesn’t matter! As long as I succeed more than fail and my successes are greater than my failures, the plan will work. And there is no need to try and hit an exact calorie amount each and every day.

I also learned a good bit about foods that work for me versus the ones that don’t. The key is to process your own food. If you allow Coca-Cola and Nabisco to do it for you, they will pack in the calories and make the food over-palatable, encouraging you to overeat. The key is learning to appreciate the subtle taste of healthy food vs the overwhelming taste of junk food. What makes food taste better? Salt, sugars, and fat. You want to be the one controlling the amounts. If you know how to cook, there is also texture, presentation, herbs and spices, etc. Guess I need to learn to cook!

As a final note, it’s fascinating how well the conclusions I’ve drawn from the data match ancient wisdom. Here’s an example from way back in the mid 1900s: