From April 1st through August 14th, I tracked everything I ate on cronometer.com as part of a weight loss challenge. Overall I lost almost 25 pounds at a rate of 1.2 pounds per week.
I always wondered what I could learn if I could scrape that data and get it into a Jupyter Notebook. In this article, I will analyze the data and hopefully demonstrate the value of scraping and analyzing personal data.
Why cronometer.com is useful for tracking dietary information
Cronometer allows you to track your foods, biometric data, exercise, and notes. It will calculate calories and a whole host of nutritional information related to vitamins, minerals, macronutrients, amino acids, etc. It will even allow you to track important nutrient ratios such as Omega-6 to Omega-3, Potassium to Sodium, and Calcium to Magnesium.Ā
Here is a sample of the diary page:

A handy summary of calories consumed, burned and remaining

Calories burned are based on your Basal Metabolic Rate, an estimate of calories burned based on your average daily activity level and the exercise you entered. On this day, I had 387 calories remaining, which means I had a calorie deficit of 387, which is a good day if youāre trying to lose weight. šŖ
The diary also displays a great deal of nutrient information, including vitamins, minerals, protein including amino acids, carbohydrates, and fats.

It shows the overall nutrition information for the day as a whole, and for each item in the food diary. Much information is just sitting there, waiting to be harvested.
Tools used to scrape the data

To scrape data off of an interactive site like cronometer, you need a tool that will automate interacting with the site.
The tool I used for automation was Selenium.
š” Selenium was great for logging in, navigating the calendar to move from day to day, and right-clicking items in the food diary to get to the detailed information. However, I used the read_html()
function from the Pandas module to extract the data from the web page.
Pandas was also the main tool for the data analysis with some graphs in Seaborn. The full code can be found on the GitHub page here.
Working with Selenium
First the imports.
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.keys import Keys
Seems like a lot of imports, but they are all necessary. The central object is the web driver. It will open a browser of your choice and automate it. So the nice thing is you can see the browser while the code is running and after. I chose Firefox for the browser. I just found it to be the easiest to work with.
URL = 'https://cronometer.com/login/' def get_driver(url): driver = webdriver.Firefox() driver.get(URL) driver.maximize_window() driver.implicitly_wait(5) set_viewport_size(driver, 1920, 3200) return driver
Letās look at driver.implicitly_wait(5)
.Ā
The implicitly_wait
function is used to set a default time for the driver to wait before throwing a NoSuchElementException
.
Modern websites rely on code to run before all the elements are loaded. If your Selenium code gets ahead of the code behind the web page, you can get hit with the NoSuchElementException
. So this default waiting time will help avoid this problem. However, there will also be times when we will also want to use explicit waits as well.

Now a few words about the set_viewport_size
function, but first I will take a deep breath and spend a few moments in my happy place.
The viewport refers to the visible area of the web page in your browser. So if you try to interact with an element that is not in the viewport, you will get an error.
My first attempt to resolve this was to scroll to each element then move to the element before trying to interact with it. And this worked, most of the time. But it would occasionally error on different elements each time. Very frustrating!
But eventually, I discovered that you can set the size of the viewport. By setting the size large enough, the problem was resolved.Ā
def set_viewport_size(driver, width, height): window_size = driver.execute_script(""" return [window.outerWidth - window.innerWidth + arguments[0], window.outerHeight - window.innerHeight + arguments[1]]; """, width, height) driver.set_window_size(*window_size) set_viewport_size(driver, 1920, 3200)
Notice that with driver.execute_script
we can run Javascript on the browser. This can be very useful.Ā
Logging in to a web site with Selenium
def log_in(driver): user_name = driver.find_element(By.NAME, 'username') password = driver.find_element(By.NAME, 'password') login = driver.find_element(By.ID, 'login-button') user_name.send_keys('email@email.com') password.send_keys('***********************') login.click() # go to the diary page click_button(driver, DIARY_XPATH)
The By
object is used to tell the driver how to find the element you want.
If you are lucky, the element can be uniquely defined by a name or an id as in this case. Filling in a form element is easy. You can just use the element.send_keys
method.
Clicking the login button was a bit more complicated because I found the need to use an explicit wait to make extra sure the element is there before trying to click it.
DIARY_XPATH = '//span[contains(text(), "Diary")]' def click_button(driver, button_xpath): try: button = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, button_xpath))) except Exception as e: print('error trying to click button', button_xpath) print(e) webdriver.ActionChains(driver).move_to_element(button).click(button).perform()
The ActionChains
object allows you to chain multiple actions to an element in one statement. In this case, I move to the element before clicking it.

What is an XPATH
? Itās a web scraper’s best friend and worst nightmare. From ChatGPT:
XPath is a query language used to traverse XML and HTML documents. In Selenium, XPath can be used to identify elements on a webpage by navigating the document’s hierarchy of nodes.
XPath is based on a set of rules for traversing the document tree. The tree consists of nodes, which can be either elements, attributes, text, or comments. XPath expressions are used to select nodes or sets of nodes in the tree, based on their relationship to other nodes.
In our example //span[(contains(text(), 'Diary')]
can be unpacked:
//span
returns all span elements in the document, regardless of location- Brackets are used to filter elements
- The
text()
function returns the text associated with the element - The
contains(text, look for this)
means look for this anywhere within the text - Putting it all together
span[(contains(text(), 'Diary')]
means give me all span elements that haveāDiaryā
anywhere within their text. Luckily, in this case, there is only one element
So in our example, the XPATH
is pretty short and identifies only the desired element. So how can XPATH
become a nightmare? When I tried to create an XPATH
to identify only the vitamin elements on a page.Ā

Here the XPATH
quickly becomes complicated. And I was able to find an expression that effectively filtered only the vitamins for one particular record. However, after running the web scraping process, which takes quite a long time, I found a few records where the data was just wrong.Ā
If you right-click on the web page and choose to inspect, it will bring up the developer tools window. Then you can hit control-f
to bring up a search box. This is how you can test your XPATH
to see what it returns.
For example:

Here I am searching for all HTML elements in the DOM.
Why do I get back 5 elements, shouldnāt there be just one? It turns out there are entire HTML documents embedded within the DOM. And their data doesnāt necessarily match what you see on the screen.
And sometimes the XPATH
expression was pulling data that didnāt match what was displayed. This means the data was wrong.Ā
Often these documents were embedded within iFrame elements.

I tried filtering out the iFrames, but nothing I did worked 100% of the time. So how did I end up scraping the actual data? With my old friend Pandas.
Scraping data with Pandas

Pandas has a read_html
method that is very powerful and simple to use. All you have to do is feed it driver.page_source
and it returns a list of DataFrames. This is very convenient because DataFrames are what I used for data cleaning and data analysis.Ā
The read_html()
method searches for data in tables and is smart enough to only give you the desired data. Fortunately, all the data I need is stored in tables.
For example, on the diary page, the daily USRDA data is stored in 6 tables under the headers:
- General,
- Carbohydrates,
- Lipids,
- Protein,
- Vitamins and
- Minerals.

First step is to get the list of DataFrames:
tables = pd.read_html(driver.page_source) print(f'{len(tables)} tables found') print('shapes: ', end='') for i in range(len(tables)): print(tables[i].shape, end=' ')
Output:
10 tables found
shapes: (26, 8) (5, 4) (6, 4) (9, 4) (13, 4) (13, 4) (11, 4) (10, 7) (1, 5) (7, 7)
The data we want is in tables with ids 1 – 6. So we just need to concatenate the tables and filter out the data we donāt want.
nutrients = pd.concat(tables[1:6]) nutrients.columns = ['item', 'quantity', 'units', 'percent_rda'] nutrients = nutrients.dropna() nutrients = nutrients[nutrients.percent_rda.str.contains('%')] nutrients.head()

By default pd.concat
stacks DataFrames vertically. The dropna()
method removes rows that have empty values.
The next line uses boolean indexing to filter the nutrients DataFrame to include rows where the value in the percent_rda
column contains a %
. This filters out nutrients like alcohol where there is no RDA.
Pandas is such a powerful and versatile tool for working with data in Python. So I was delighted to find out it can also scrape data.
However, I would like to find something to handle the automation that is a little simpler to work with than Selenium. It does get the job done; perhaps I just need more experience.
Right-clicking with Selenium

The main diary page has nutrient information for the day as a whole, but you can get nutrient information for each item in the food diary by right-clicking the item and choosing ādetailsā in the pop-up menu.
The first step is to find a way to access the food diary rows directly. For that we return to our old friend/nemesis the XPATH
.Ā
FOOD_DIARY_XPATH = "//table[@class='crono-table']//td[@class='diary-time']/parent::tr"
Unpacking the expression:
//table
means give me all tables anywhere in the document[@class=ācrono-tableā]
means of those tables only give me the ones that contain the classācrono-tableā
//td[@class=ādiary-timeā]
means give metd
elements that fall anywhere under the tables we got from the previous step but only if they contain the class diary-time/parent::tr
means: Ok, now let’s go up one level to the parent but only if it is atr
element.
So we can see the XPATH
can pack a great deal of filtering logic into one dense compact statement. Itās a lot like regular expressions in that regard.
Likewise, we need an XPATH
expression for the details row in the pop-up menu
VIEW_EDIT_XPATH = "//*[contains(text(), 'View/Edit')]"
Here the asterisk *
is a wildcard. So this expression gives us any element that contains the text āView/Editā.Ā
Here is the code to get all the food diary elements into a list:
wait = WebDriverWait(driver, 20) diary = [] diary_elements = wait.until(EC.visibility_of_all_elements_located((By.XPATH, FOOD_DIARY_XPATH))) diary_elements = [wait.until(EC.element_to_be_clickable(e)) for e in diary_elements]
WebDriverWait
defines an explicit wait. By explicit, this means it waits until a condition is met.
We told it to wait a maximum of 20 seconds for this condition to be met.
The first condition we look for is that all elements can be located by Selenium. If you donāt wait, your code will sometimes get ahead of the page the driver is trying to load, and you will get an error.
With the last line of code, I am using a list comprehension to make sure each diary element is actually clickable before the element is added to the final list. It is possible for an element to be visible but not yet clickable. This will lead to an error when we try to right-click the element.
Working with the calendar in cronometer

This was a fun puzzle to solve. How do you get to April 1 2021 from today using the controls to go back a year, back or forward a month, then locating the first day of the month on the calendar.
Here is what the calendar looks like:

The first step is to get to the right year and month:
last_year_xpath = "//div[contains(text(), 'Ā«')]" next_month_xpath = "//div[contains(text(), 'āŗ')]" last_month_xpath = "//div[contains(text(), 'ā¹')]" target_date = datetime.strptime(target_date, '%Y-%m-%d') today = datetime.today() last_year_button = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, last_year_xpath))) next_month_button = driver.find_element(By.XPATH, next_month_xpath) last_month_button = driver.find_element(By.XPATH, last_month_xpath) for _ in range(today.year - target_date.year): ac = webdriver.ActionChains(driver) ac.move_to_element(last_year_button).click(last_year_button).perform() time.sleep(2) if target_date.month > today.month: for _ in range(target_date.month - today.month): next_month_button.click() time.sleep(2) else: for _ in range(today.month - target_date.month): last_month_button.click() time.sleep(2)
Next, find the control for the day.
There are 42 days on the calendar: 3 from the previous month, 31 for the current month, and 8 for the next month.
We want the calendar element with the text ā1ā, but only the first one. The day controls all have a unique id starting at 100. The problem is the id for the first day of the month can vary.
while first_day_text != '1': first_day_id += 1 first_day_css = f"td#calendar-date-{first_day_id}" first_day_div = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.CSS_SELECTOR, first_day_css)) ) first_day_text = first_day_div.text first_day_div.click()
Then after scraping data for day 1, I just have to click the tomorrow
button on the calendar and do it again until I finally reach August 15, 2021.
Selenium was a bit frustrating until I got the hang of it. However, once I increased the viewport size and used explicit waits, it got the job done for website automation. The read_html
function from Pandas turned out to be a lifesaver for doing the actual scraping of the data.Ā
Data Analysis

Now for the fun part. After spending so much time scraping the data, it’s time to dive into some analysis!
Overall I lost .17 pounds per day with a standard deviation of .91 pounds. This lasted for 131 days for a total of 24.2 pounds lost.
Here is a scatter plot of Weight vs Day of Challenge including a regression line:

Wow, that is surprisingly linear! I always thought weight loss was supposed to be fast initially, then taper off.
The R-Squared value of .98 is very high. R-squared measures how well the regression line fits the data. Values range between 0 and 1.
An R-squared of 0 would indicate the regression line doesnāt fit the data at all. An R-squared of 1 indicates the regression line fits the data perfectly.
Another interpretation is 98% of the variation of weight can be explained by the day on the program. In other words, the plan worked like a charm! Slow and steady wins the race.
Here is the code for the graph above. I used the LinearRegression
class from the sklearn
module to create the regression line. Unfortunately, to get LinearRegression
to work for simple regression with only one feature we have to reshape the data.
from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score import matplotlib.pyplot as plt import seaborn as sns def plot_regression(data, feature, target, title): # sklearn expects a 2d matrix so we have to reshape pandas series # an array of size n is reshaped into a matrix with n rows and 1 column y = data[target].values.reshape(-1, 1) X = data[feature].values.reshape(-1, 1) model = LinearRegression() model.fit(X, y) # get slope and intercept from model slope = model.coef_[0][0] intercept = model.intercept_[0] # use slope and intercept to create predictions weight_pred = intercept + slope * X.reshape(-1) # use R2 score to compare predictions to true values r2 = r2_score(data[target], weight_pred) # plot plt.figure(figsize=(12,8)) sns.scatterplot(x=feature, y=target, data=data) plt.plot(X.reshape(-1), weight_pred, linewidth=1, color='r', label=f'y={slope:.2f} * x + {intercept:.1f}') # add a second row to the title to display R2 plt.title(title + f'\nr2 = {r2:.2f} ')
The fact that the mean daily weight loss is only .17 pounds with a relatively large standard deviation of .98 pounds leads to some short-term results that can be quite frustrating.
For example, here is a two-week stretch where it felt like nothing was working:

For comparison, here is a two-week stretch where everything seemed easy:

So slow and steady may win the race, but it can often feel like losing. The trick is to have faith in the plan and keep on truckin’.
We can use a histogram to look at the distribution of weekly weight loss amounts:

More good weeks than bad, and the best week dominates the worst week in absolute value: 3.5 pounds lost vs 1.5 pounds gained. There were enough positive results to stay motivated.
What if I repeated this challenge many times? What would the range of values for average weekly weight loss look like?
I canāt very well replicate the experiment 1,000 times, but I can estimate a 95% confidence interval using the bootstrap method.
This uses resampling with replacement to generate hypothetical samples which can be used to create a confidence interval. Because we are resampling with replacement some values can occur more than once in a given sample and others not at all.
This means we can generate samples from our data that are different from each other but still pulled from the same original data.

Assuming the factors leading to my current data hold, I believe I am 95% certain if I replicated this experiment, I would lose somewhere from a pound to almost a pound and a half a week.
This also matches my previous experience. In previous weight loss challenges, I lost weight at a little over a pound a week. The fancy bootstrap method just makes it official.Ā
Looking at total calories over time

My daily goal was to hit a caloric deficit of at least 200 calories. Luckily cronometer will help you calculate an estimate of the number of calories you burn on a typical day.
It will measure your Basal Metabolic Rate and estimate how many calories you burn each day through activity. For me, the total number is 2218 calories per day.
If I eat this amount, I should maintain my weight. If I consistently eat less, I should lose weight. 2000 was a good round number to try and hit each day. So how did I do?

I struggled to hit my daily target early in the challenge. This may explain why I didnāt experience more rapid weight loss at the start. Luckily most days were below the break-even point of 2218 calories so I still lost weight.
After day 50, I hit the target most days. This shows I got better at eating less calories over time. Overall The total calories were not consistent at all, but they didnāt need to be. What seems to matter is the long run average.
In hindsight, 2000 calories is still a good target even though I canāt expect to hit it every day. By setting a mildly ambitious target, I set up a situation where I can fail a little bit and still be Ok.Ā
Correlations

We know that days correlate very highly with weight but what about calories? What other interesting correlations might we find?
We can use a correlation heat map to find out. For calories, I added some calculated fields to make it interesting.
yesterday_total_calories
– total calories offset one day in the pasttotal_calories_7dma
– average calories for the previous 7 daystotal_calories_14dma
– average calories for the previous 14 daystotal_calories_21dma
– average calories for the previous 21 days
The reason for adding the moving averages is to smooth out the day-to-day variation.
Here is the code to create the heatmap:
def correlation_heatmap(df, title): corr = df.corr() # Generate a mask for the upper triangle mask = np.zeros_like(corr, dtype=bool) mask[np.triu_indices_from(mask)] = True # Set up the matplotlib figure f, ax = plt.subplots(figsize=(11, 9)) # Draw the heatmap with the mask sns.heatmap(corr, mask=mask, cmap='BuPu', center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True) plt.title(title) plt.show()

As expected, the longer the time frame for the moving average, the higher the correlation between past calorie consumption and the current day’s weight.
Does this mean that what I ate 14 days ago affects my weight today?
I donāt think so. I think a lot of things, such as hydration levels, can affect your weight at any given point in time. But that averages out, in the long run, leaving total calories as the dominating factor determining body weight.
Good Days, Bad Days

You know Iāve had my share. What did I eat on bad days vs good days?
I defined a bad day as any day I had a caloric surplus > 100 calories. It turns out I had 20 bad days, thatās 15% of the days in the challenge. Thatās a lot more than I remember.Ā

Damn you sourdough, damn you straight to hell! Why do you have to taste so good? I donāt miss the other foods Iāve given up like frozen pizza, chips, cookies, soda, ice cream etc. But do I have to give up that fluffy slice of heaven known as sourdough bread? Apparently so. They say you can lose weight without giving up the foods you love. They lie. As the immortal Jack LaLanne once said āIf it tastes good, spit it out!ā
Why would decaf coffee show up on this list? I used a high-calorie creamer and drank extra cups on bad days. And itās also something I drank pretty much every day.
For comparison, letās look at the top calorie sources on good days, defined as any day with a calorie deficit > 100 calories.

Boiled potatoes, quinoa, tofu, bananas, and sardines. Doesnāt sound very appetizing does it?
Apparently thatās why they work as weight-loss foods. Oh well, at least I have beer. It is a matter of pride that I could have one beer a day and still lose weight. I really looked forward to that beer every day. The sardines, not so much.
Why does tofu work well as a weight-loss food? Itās high in protein, and it sits in your stomach like a brick. And it wonāt stimulate your appetite. Boiled potatoes are similarly filling due to the high water content. Most people think potatoes are a fattening food, but I think itās all in how they are prepared. If you fry them in oil and smother them in salt, then absolutely they become junk food: dense in calories and overstimulating to the appetite.
Really, Really good days

There were 4 days where I was able to eat less than 1400 calories total. What did those days look like?

Honey water???
Basically, thatās just herbal tea sweetened with honey. Apparently, I drank a lot on those days. Makes sense to fill up on liquids when trying to lose weight.
And I think sipping on herbal tea also distracted me from the fact that I wasnāt eating as much. And consider that a 12 ounce can of Coke has 39 grams of sugar, whereas a teaspoon of honey only has 5.6 grams of sugar. 39 grams of sugar is about 10 teaspoons worth!
I couldnāt imagine adding 10 teaspoons of sugar to a mug of tea, or to any drink for that matter. I couldnāt even imagine adding 10 teaspoons of sugar to a bowl of Cheerios. What happens when someone gets used to that much sugar? Healthy foods wonāt taste sweet enough any more.
Which foods are most nutritious?

I created a nutrient score by adding up the percent of US RDA for the vitamins and minerals for each food item in my diary divided by the number of calories.
The units for each food item is just how much I ate that day. So Iām looking at which foods contributed the most to meeting my nutrient needs for the least number of calories.

Greens for the win! Adding a variety of leafy greens each day is a really good idea. And spinach tastes pretty good as long as itās fresh, especially baby spinach. Cilantro also adds an interesting flavor.
What about sodium?
Sodium is one nutrient you donāt want to get too much of. Unfortunately, the sodium content in processed foods is very high. There were 30 days where I got more than 150% of the US RDA (Recommended Daily Allowance) of sodium, and 10 days I got higher than 200%!
What foods did I eat that were highest in sodium?

You’re killinā me Trader Joe!
Basically all these are convenience foods that taste pretty good. The cost is too much sodium and calories. This brings to mind another Jack LaLanne quote “If man makes it, don’t eat it”. The good news is if I donāt eat these foods, I can afford to add some salt to my dinner.
A bit of salt does wonders for the taste of foods like quinoa.
Conclusions

I was able to create a real-world regression model with only one feature that is extremely accurate.
All I need is the starting date and the number of days into the weight loss regimen and I can predict how much weight I lost with a high degree of accuracy. An R-squared of .98 is pretty darn good! The only caveat is the model is only going to be accurate after about 3 weeks.Ā
I also learned a lot from analyzing the data after the fact. I was surprised at the number of times I actually failed to meet my daily targets. Yet the encouraging thing is it doesnāt matter! As long as I succeed more than fail and my successes are greater than my failures, the plan will work. And there is no need to try and hit an exact calorie amount each and every day.
I also learned a good bit about foods that work for me versus the ones that donāt. The key is to process your own food. If you allow Coca-Cola and Nabisco to do it for you, they will pack in the calories and make the food over-palatable, encouraging you to overeat. The key is learning to appreciate the subtle taste of healthy food vs the overwhelming taste of junk food. What makes food taste better? Salt, sugars, and fat. You want to be the one controlling the amounts.Ā If you know how to cook, there is also texture, presentation, herbs and spices, etc. Guess I need to learn to cook!
As a final note, itās fascinating how well the conclusions Iāve drawn from the data match ancient wisdom. Hereās an example from way back in the mid 1900s: