This data science project, conducted at UCSD, explores the relationship between calories and average recipe ratings. The complete analysis and source code are available on GitHub. According to the Food and Drug Administration (FDA), calories are a measure of how much energy you get from one serving of food. Consuming more calories than you use on a daily basis is linked to overweight and obesity. The dataset used for this project contains recipes and ratings from food.com posted since 2008. This dataset was originally scraped and used in the paper, Generating Personalized Recipes from Historical User Preferences, by Majumder et al.
| Column | Description |
|---|---|
name |
Recipe name |
id |
Recipe ID |
minutes |
Minutes to prepare recipe |
contributor_id |
User ID who submitted this recipe |
submitted |
Date recipe was submitted |
tags |
Food.com tags for recipe |
nutrition |
Nutrition information in the form [calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV)]; PDV stands for "percentage of daily value" |
n_steps |
Number of steps in recipe |
steps |
Text for recipe steps, in order |
description |
User-provided description |
ingredients |
Text for recipe ingredients |
n_ingredients |
Number of ingredients in recipe |
| Column | Description |
|---|---|
user_id |
User ID |
recipe_id |
Recipe ID |
date |
Date of interaction |
rating |
Rating given |
review |
Review text |
Using these datasets I explored whether the recipes with high calories were rated lower than recipes with low calories. The answer to this question could provide useful insight into how calories affect an individual's perception of a recipe.
To make the datasets more suitable for my analysis I made the following transformations to the
recipes and ratings datasets.
recipes and ratings datasets together.np.nan in the merged dataset.recipes dataset.nutrition, tags, steps, and
ingredients columns from str to list.
nutrition column into individual columns.calorie_level column that labels recipes as either "High" or "Low" depending if
their calories are greater than or less than the median for calories.The cleaned dataset features 83,782 rows and 20 columns. Below are the selected columns and their respective data types.
| Column | Type |
|---|---|
name |
object |
id |
int64 |
minutes |
int64 |
submitted |
object |
tags |
object |
n_steps |
int64 |
steps |
object |
description |
object |
ingredients |
object |
n_ingredients |
int64 |
avg_rating |
float64 |
calories |
float64 |
total_fat |
float64 |
sugar |
float64 |
sodium |
float64 |
protein |
float64 |
saturated_fat |
float64 |
carbohydrates |
float64 |
calorie_level |
object |
For this plot I analyzed the distribution of calories. There were a lot of large
outliers present in the calories column. In order to avoid them, I dropped the outliers
of calories using the interquartile range (IQR) method. The distribution is skewed to
the right, indicating that more recipes with lower calories than higher calories.
For this plot I analyzed the distribution of avg_rating. The distribution is skewed to
the left, indicating most of the recipes have a high average rating. It appears that very few
recipes have a average rating below 4.
For my bivariate analysis, I created two columns is_dietary and is_healthy to
help with the analysis. is_dietary indicates if a recipe contains the dietary tag and
is_healthy indicates if a recipe conatains the healthy tag.
For this plot I analyzed the distribution of avg_rating between dietary and non-dietary
recipes. The distribution is still heavily skewed to the left. Non-dietary recipes appear to have
slightly higher average ratings than dietary recipes.
For this plot I analyzed the distribution of calories by avg_rating between
healthy and non-healthy recipes. Healthy recipes seem to have less calories than non-healthy
recipes.
For this section, I investigated the relationship between the number of ingredients and
calories of the recipes. First, I dropped outliers in the calories column using the IQR
method. Then, I grouped the cleaned recipes dataframe by n_ingredients. I then
selected the calories column and applied the aggregate functions, mean, median, and count.
| # of ingredients | mean | median | count |
|---|---|---|---|
| 1 | 205.455 | 175.1 | 11 |
| 2 | 209.569 | 138.3 | 698 |
| 3 | 218.418 | 158.1 | 2233 |
| 4 | 240.777 | 185.3 | 4261 |
| 5 | 258.783 | 211.65 | 6236 |
| 6 | 280.463 | 233.5 | 7173 |
| 7 | 302.227 | 258.55 | 8064 |
| 8 | 321.79 | 278.1 | 8530 |
| 9 | 333.14 | 294.2 | 8183 |
| 10 | 343.212 | 304.95 | 7566 |
| 11 | 360.412 | 323.8 | 6544 |
| 12 | 372.655 | 337.3 | 5351 |
| 13 | 389.226 | 360.2 | 4195 |
| 14 | 403.049 | 374.65 | 2998 |
| 15 | 418.244 | 385.4 | 2183 |
| 16 | 436.868 | 412.9 | 1530 |
| 17 | 453.48 | 422.9 | 1048 |
| 18 | 461.114 | 433.55 | 678 |
| 19 | 473.961 | 439.3 | 456 |
| 20 | 480.436 | 449.6 | 324 |
| 21 | 484.994 | 441.9 | 193 |
| 22 | 504.149 | 489.1 | 123 |
| 23 | 508.351 | 509.85 | 84 |
| ... | ... | ... | ... |
| 30 | 588.2 | 594.3 | 11 |
| 31 | 402.94 | 479.9 | 5 |
| 32 | 363.1 | 363.1 | 1 |
| 33 | 338.2 | 338.2 | 1 |
Below is a visual representation of this table with two y-axis to view the data more clearly. It appears that the most popular amount of ingredients to have is 8. This shows that a majority of recipes have a moderate complexity in terms of ingredient count. Both the mean and median calorie counts tend to increase steadily as the number of ingredients increases. This suggests that more complex recipes generally contain higher calorie counts. The mean and median calories track closely together, indicating that the calorie distribution for each number of ingredients is roughly symmetric.
I believe the review column from the ratings dataset is not missing at random
(NMAR). People who viewed a recipe as mediocre or forgettable would not be inclined to leave a review
because they would not have much to praise or criticize about the recipe. If the recipe was just in the
middle of the road for them they would not feel strongly enough to take the time and effort to write and
publish a review. In contrast, an individual who really enjoyed or despised the recipe would be much
more inclined to put in the time and effort to leave a review for the recipe.
I filtered the cleaned recipes dataset to see which column had the most missing
values and found that avg_rating had the most missing values. I started by creating a
avg_rating_missing column that indicated if the average rating was missing for the recipe.
Calories:
The first test I conducted was to determine whether the missingness of avg_rating was
dependent on calories. I began by droping the outlier in calories column
using the IQR method.
Null Hypothesis:
The missingness of average rating does not depend on the calories in a recipe.
Alternate Hypothesis:
The missingness of average rating does depend on the calories in a recipe.
Test Statistic:
Absolute Mean Difference
Significance Level:
0.05
I performed a permutation test by shuffling the avg_rating_missing column
1000 times and calculated the simulated absolute mean differences of calories between the missing
and not missing groups.
This plot displays the distribution of simulated test statistics compared to the observed statistic. The observed statistic was 87.86 and is displayed as the red line. None of the simulated test statistics were greater than or equal to the observed statistic, thus the p-value was 0.0. Since the p-value was less than our significance level of 0.05, we reject the null hypothesis. Thus, there is evidence to support the alternative hypothesis. The missingness of average rating does depend on the calories in a recipe.
Minutes:
The second test I conducted was to determine whether the missingness of avg_rating was
dependent on minutes. I began by droping the outlier in minutes column
using the IQR method.
Null Hypothesis:
The missingness of average rating does not depend on the sodium (PDV) a recipe.
Alternate Hypothesis:
The missingness of average rating does depend on the sodium (PDV) a recipe.
Test Statistic:
Absolute Mean Difference
Significance Level:
0.05
I performed a permutation test again by shuffling the avg_rating_missing
column 1000 times and calculated the simulated absolute mean differences of sodium (PDV) between the
missing and not missing groups.
The observed statistic was 0.35 and is displayed as the red line. This time, a majority of the simulated test statistics were greater than or equal to the observed statistic, thus the p-value was 0.89. Since the p-value was greater than our significance level of 0.05, we fail to reject the null hypothesis. The missingness of average rating does not depend on the sodium (PDV) a recipe.
For my hypothesis test, I explored the relationship between a recipe's average rating and the calorie level. As mentioned in data cleaning, recipes with a calorie count greater than the median were classified as "high calorie" and recipes with a calorie count below were classified as "low calorie." To analyze this relationship, I conducted a permutation test.
Null Hypothesis:
The average rating of low-calorie recipes is equal to or less than the average rating of
high-calorie recipes.
Alternative Hypothesis:
The average rating of low-calorie recipes is higher than that of high-calorie recipes.
Test Statistic:
Difference of Means (Low - High)
Significance Level:
0.05
I decided to use a permutation test because it does not rely on distributional assumptions and is well suited for comparing groups in this scenario. I proposed that the average rating for low-calorie recipes is higher than higher-calorie recipes because people may be worried about the negative health risks present in the recipe. I chose the difference of means as the test statisitic because I wanted to find the whether low-calorie recipes were rated higher on average than high-calorie recipes. The difference of means, opposed to the absolute difference of means, shows the which group has a higher mean average rating.
I shuffled the avg_rating column 1000 times and calculated the mean differences between the
low and high calorie groups.
The observed statistic was 0.008 and is displayed as the red line. The p-value calculated from this test was 0.039 which is less than our 0.05. Thus, we reject the null hypothesis in favor of the alternative hypothesis. The average rating of low-calorie recipes tend to be higher than that of high-calorie recipes.
I aim to predict a recipe's average score. I plan to use a classification model that
classifies recipes into three categories, high, medium, and low. High would be
scores greater than 3.75, medium would be scores between 3.75 and 2.75, and low would be any score less
than 2.75.
I opted to use a classification model opposed to a regression model because there is not much variance
amongst the avg_rating column. We also found out earlier that the average rating
distirbution is
heavily skewed to the left, with most recipes receiving high ratings (primarily 5). To address this
imbalance and to focus on predicting meaningful distinctions, I decided to classify the average ratings
into high, medium, and low categories. This approach allows the model to better capture these distinct
groups instead of trying to predict a nearly uniform continuous target.
To evaluate this model, I will be using F1 score and accuracy because
avg_rating is
heavily imbalanced, with the majority of ratings being "high." The F1 score takes into account both
precision and recall across all classes and gives more weight to the performance on the larger classes.
Accuracy provides an overall snapshot of how many predictions were correct. Together, these metrics give
a comprehensive picture of how well the model performs across all classes.
For my baseline model I am implementing a random forest classifier. I will be using the
features, calories and is_dietary to help classify avg_rating as
high, medium, or low. First, I created a avg_rating_class column that indicated if a recipe
was high, medium, ore low calorie. Then, I split the data points into training and testing sets to be
used to train and test my model.
calories
The number of calories in the recipe.
is_dietary
1 if the recipe contains the dietary tag and 0 otherwise.
The F1 score of this model was 0.90 and the accuracy of this model was 0.92. These scores are really high and suggest the model is performing as intended. Although this may just be a facade, the F1 score for "high" is 0.96, but for "medium" and "low" the F1 score is a measly 0.01 for both. This is likely due to the fact that a majority of the average ratings falls above 4.
For my final model I engineered new features that could help. These features were
avg_step_desc_length, description_length, and year_submitted.
calories
The number of calories in the recipe. I used RobustScaler to transform
calories because many outliers are present in the calories column. Thus,
the values were scaled while also handling outliers.
is_dietary
1 if the recipe contains the dietary tag and 0 otherwise. From the EDA, I noticed that dietary recipes had slightly lower ratings than non-dietary ratings. This may prove to be usedful for classifying average rating.
avg_step_desc_length
Using the steps column I took the length of each individual step and
calculated the average length of each step. Recipes with higher average step length have more in
depth steps to follow. After plotting and overlaid histogram I found that recipes with longer
average step descriptions tend to have slightly higher average scores than recipes with shorter
descriptions. I applied StandardScaler to avg_step_desc_length inorder to
make sure that average step description lengths were compareable in range because some recipes
featured very long steps.
description_length
Using the description column I extracted the length of each description.
Detailed descriptions may sway the way people rate recipes. After plotting an overlaid hisogram, I
found that recipes with longer descriptions tend to have slightly lower average scores than recipes
with shorter descriptions. I applied StandardScaler to description_length
inorder to make sure that description lengths were compareable in range because some recipes
featured very description.
year_submitted
Using the submitted column I extracted the year each recipe was submitted. I
grouped the dataset by year_submitted, selected the avg_rating column,
then took the mean of the avg_rating values. Using this I was able to create a line
plot that showed that recipes submitted more recently had lower mean average score. This could be
helpful for classifying average rating.
I also implemented GridSearchCV to hyperparameterize my random forest classifier.
I aimed to tune the parameters, n_estimators, max_depth, and
min_samples_split. It was found that the best parameters were a 50
n_estimators, a max depth of 5, and a min_samples_split of 2.
My final model ended up with an F1 score of 0.91 and an accuracy of 0.94. An small improvement from my base model, but unfortunately it did not improve much classifying "medium" and "low" calorie recipes. Looking back it may have been easier to predict a variable with more variance. Roughly, 58.9% of the average ratings were 5.0 which most likely made prediciting classes outside of "high" difficult.
For my fairness analysis, I split the dataset into low and high calorie recipes. Similar to my hypothesis test, low-calorire recipes are recipes that are below the median and high-calorie recipes are recipes that are above.
Null Hypothesis:
Our model is fair. Its precision for low calorie and high calorie recipes are roughly the same, and
any differences are due to random chance.
Alternative Hypothesis:
Our model is unfair. Its precision for low calorie recipes is higher than its precision for high
calorie recipes.
Test Statistic:
Difference in Precission (High - Low)
Significance Level:
0.05
I ran a permutation test to find an outcome. I shuffled the calorie_level column 1000
times and calculated the mean differences between the high and low calorie groups. I got an observed
test statistic of 0.00 that is displayed as the red line. The p-value for this test was 0.469 which
is greater than the significance level of 0.05. Thus, we fail to reject the null hypothesis. Our
model's precision for low and high calories are roughly the same.