By: Justin Lee
This data science project, conducted at UCSD, explores the relationship between calories and average recipe ratings. According to the Food and Drug Administration (FDA), calories are a measure of how much energy you get from one serving of food. Consuming more calories than you use on a daily basis is linked to overweight and obesity. The dataset used for this project contains recipes and ratings from food.com posted since 2008. This dataset was originally scraped and used in the paper, Generating Personalized Recipes from Historical User Preferences, by Majumder et al.
The recipe
dataset contains 83,782 unique recipes. The following information is given for each recipe:
Column | Description |
---|---|
name |
Recipe name |
id |
Recipe ID |
minutes |
Minutes to prepare recipe |
contributor_id |
User ID who submitted this recipe |
submitted |
Date recipe was submitted |
tags |
Food.com tags for recipe |
nutrition |
Nutrition information in the form [calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV)]; PDV stands for “percentage of daily value” |
n_steps |
Number of steps in recipe |
steps |
Text for recipe steps, in order |
description |
User-provided description |
ingredients |
Text for recipe ingredients |
n_ingredients |
Number of ingredients in recipe |
The ratings
dataset contains 731,927 unique reviews. The following information is given for each review:
Column | Description |
---|---|
user_id |
User ID |
recipe_id |
Recipe ID |
date |
Date of interaction |
rating |
Rating given |
review |
Review text |
Using these datasets I explored whether the recipes with high calories were rated lower than recipes with low calories. The answer to this question could provide useful insight into how calories affect an individual’s perception of a recipe.
To make the datasets more suitable for my analysis I made the following transformations to the recipes
and ratings
datasets.
Left merge the recipes
and ratings
datasets together.
Filled all ratings of 0 with np.nan
in the merged
dataset.
np.nan
avoids having biased average ratings.Calculated the average rating per recipe, as a Series.
Added this the average recipe rating Series back to the recipes
dataset.
Converted nutrition
, tags
, steps
, and ingredients
columns from str to list.
Split the nutrition
column into individual columns.
Added calorie_level
column that labels recipes as either “High” or “Low” depending if their calories are greater than or less than the median for calories
.
Drop columns unrelated to my topic.
The cleaned dataset features 83,782 rows and 20 columns. Below are the selected columns and their respective data types.
Column | Type |
---|---|
name |
object |
id |
int64 |
minutes |
int64 |
submitted |
object |
tags |
object |
n_steps |
int64 |
steps |
object |
description |
object |
ingredients |
object |
n_ingredients |
int64 |
avg_rating |
float64 |
calories |
float64 |
total_fat |
float64 |
sugar |
float64 |
sodium |
float64 |
protein |
float64 |
saturated_fat |
float64 |
carbohydrates |
float64 |
calorie_level |
object |
For this plot I analyzed the distribution of calories
. There were a lot of large outliers present in the calories
column. In order to avoid them, I dropped the outliers of calories
using the interquartile range (IQR) method. The distribution is skewed to the right, indicating that more recipes with lower calories than higher calories.
For this plot I analyzed the distribution of avg_rating
. The distribution is skewed to the left, indicating most of the recipes have a high average rating. It appears that very few recipes have a average rating below 4.
For my bivariate analysis, I created two columns is_dietary
and is_healthy
to help with the analysis. is_dietary
indicates if a recipe contains the dietary tag and is_healthy
indicates if a recipe conatains the healthy tag.
For this plot I analyzed the distribution of avg_rating
between dietary and non-dietary recipes. The distribution is still heavily skewed to the left. Non-dietary recipes appear to have slightly higher average ratings than dietary recipes.
For this plot I analyzed the distribution of calories
by avg_rating
between healthy and non-healthy recipes. Healthy recipes seem to have less calories than non-healthy recipes.
For this section, I investigated the relationship between the number of ingredients and calories of the recipes. First, I dropped outliers in the calories
column using the IQR method. Then, I grouped the cleaned recipes
dataframe by n_ingredients
. I then selected the calories
column and applied the aggregate functions, mean, median, and count.
# of ingredients | mean | median | count |
---|---|---|---|
1 | 205.455 | 175.1 | 11 |
2 | 209.569 | 138.3 | 698 |
3 | 218.418 | 158.1 | 2233 |
4 | 240.777 | 185.3 | 4261 |
5 | 258.783 | 211.65 | 6236 |
6 | 280.463 | 233.5 | 7173 |
7 | 302.227 | 258.55 | 8064 |
8 | 321.79 | 278.1 | 8530 |
9 | 333.14 | 294.2 | 8183 |
10 | 343.212 | 304.95 | 7566 |
11 | 360.412 | 323.8 | 6544 |
12 | 372.655 | 337.3 | 5351 |
13 | 389.226 | 360.2 | 4195 |
14 | 403.049 | 374.65 | 2998 |
15 | 418.244 | 385.4 | 2183 |
16 | 436.868 | 412.9 | 1530 |
17 | 453.48 | 422.9 | 1048 |
18 | 461.114 | 433.55 | 678 |
19 | 473.961 | 439.3 | 456 |
20 | 480.436 | 449.6 | 324 |
21 | 484.994 | 441.9 | 193 |
22 | 504.149 | 489.1 | 123 |
23 | 508.351 | 509.85 | 84 |
… | … | … | … |
30 | 588.2 | 594.3 | 11 |
31 | 402.94 | 479.9 | 5 |
32 | 363.1 | 363.1 | 1 |
33 | 338.2 | 338.2 | 1 |
Below is a visual representation of this table with two y-axis to view the data more clearly. It appears that the most popular amount of ingredients to have is 8. This shows that a majority of recipes have a moderate complexity in terms of ingredient count. Both the mean and median calorie counts tend to increase steadily as the number of ingredients increases. This suggests that more complex recipes generally contain higher calorie counts. The mean and median calories track closely together, indicating that the calorie distribution for each number of ingredients is roughly symmetric.
I believe the review
column from the ratings
dataset is not missing at random (NMAR). People who viewed a recipe as mediocre or forgettable would not be inclined to leave a review because they would not have much to praise or criticize about the recipe. If the recipe was just in the middle of the road for them they would not feel strongly enough to take the time and effort to write and publish a review. In contrast, an individual who really enjoyed or despised the recipe would be much more inclined to put in the time and effort to leave a review for the recipe.
I filtered the cleaned recipes
dataset to see which column had the most missing values and found that avg_rating
had the most missing values. I started by creating a avg_rating_missing
column that indicated if the average rating was missing for the recipe.
Calories:
The first test I conducted was to determine whether the missingness of avg_rating
was dependent on calories
. I began by droping the outlier in calories
column using the IQR method.
Null Hypothesis:
The missingness of average rating does not depend on the calories in a recipe.
Alternate Hypothesis:
The missingness of average rating does depend on the calories in a recipe.
Test Statistic:
Absolute Mean Difference
Significance Level:
0.05
I performed a permutation test by shuffling the avg_rating_missing
column 1000 times and calculated the simulated absolute mean differences of calories between the missing and not missing groups.
This plot displays the distribution of simulated test statistics compared to the observed statistic. The observed statistic was 87.86 and is displayed as the red line. None of the simulated test statistics were greater than or equal to the observed statistic, thus the p-value was 0.0. Since the p-value was less than our significance level of 0.05, we reject the null hypothesis. Thus, there is evidence to support the alternative hypothesis. The missingness of average rating does depend on the calories in a recipe.
Minutes:
The second test I conducted was to determine whether the missingness of avg_rating
was dependent on minutes
. I began by droping the outlier in minutes
column using the IQR method.
Null Hypothesis:
The missingness of average rating does not depend on the sodium (PDV) a recipe.
Alternate Hypothesis:
The missingness of average rating does depend on the sodium (PDV) a recipe.
Test Statistic:
Absolute Mean Difference
Significance Level:
0.05
I performed a permutation test again by shuffling the avg_rating_missing
column 1000 times and calculated the simulated absolute mean differences of sodium (PDV) between the missing and not missing groups.
The observed statistic was 0.35 and is displayed as the red line. This time, a majority of the simulated test statistics were greater than or equal to the observed statistic, thus the p-value was 0.89. Since the p-value was greater than our significance level of 0.05, we fail to reject the null hypothesis. The missingness of average rating does not depend on the sodium (PDV) a recipe.
For my hypothesis test, I explored the relationship between a recipe’s average rating and the calorie level. As mentioned in data cleaning, recipes with a calorie count greater than the median were classified as “high calorie” and recipes with a calorie count below were classified as “low calorie.” To analyze this relationship, I conducted a permutation test.
Null Hypothesis:
The average rating of low-calorie recipes is equal to or less than the average rating of high-calorie recipes.
Alternative Hypothesis:
The average rating of low-calorie recipes is higher than that of high-calorie recipes.
Test Statistic:
Difference of Means (Low - High)
Significance Level:
0.05
I decided to use a permutation test because it does not rely on distributional assumptions and is well suited for comparing groups in this scenario. I proposed that the average rating for low-calorie recipes is higher than higher-calorie recipes because people may be worried about the negative health risks present in the recipe. I chose the difference of means as the test statisitic because I wanted to find the whether low-calorie recipes were rated higher on average than high-calorie recipes. The difference of means, opposed to the absolute difference of means, shows the which group has a higher mean average rating.
I shuffled the avg_rating
column 1000 times and calculated the mean differences between the low and high calorie groups.
The observed statistic was 0.008 and is displayed as the red line. The p-value calculated from this test was 0.039 which is less than our 0.05. Thus, we reject the null hypothesis in favor of the alternative hypothesis. The average rating of low-calorie recipes tend to be higher than that of high-calorie recipes.
I aim to predict a recipe's average score. I plan to use a classification model that classifies recipes into three categories, high, medium, and low. High would be scores greater than 3.75, medium would be scores between 3.75 and 2.75, and low would be any score less than 2.75.
I opted to use a classification model opposed to a regression model because there is not much variance amongst the avg_rating
column. We also found out earlier that the average rating distirbution is heavily skewed to the left, with most recipes receiving high ratings (primarily 5). To address this imbalance and to focus on predicting meaningful distinctions, I decided to classify the average ratings into high, medium, and low categories. This approach allows the model to better capture these distinct groups instead of trying to predict a nearly uniform continuous target.
To evaluate this model, I will be using F1 score and accuracy because avg_rating
is heavily imbalanced, with the majority of ratings being “high.” The F1 score takes into account both precision and recall across all classes and gives more weight to the performance on the larger classes. Accuracy provides an overall snapshot of how many predictions were correct. Together, these metrics give a comprehensive picture of how well the model performs across all classes.
For my baseline model I am implementing a random forest classifier. I will be using the features, calories
and is_dietary
to help classify avg_rating
as high, medium, or low. First, I created a avg_rating_class
column that indicated if a recipe was high, medium, ore low calorie. Then, I split the data points into training and testing sets to be used to train and test my model.
calories
The number of calories in the recipe.
is_dietary
1 if the recipe contains the dietary tag and 0 otherwise.
The F1 score of this model was 0.90 and the accuracy of this model was 0.92. These scores are really high and suggest the model is performing as intended. Although this may just be a facade, the F1 score for “high” is 0.96, but for “medium” and “low” the F1 score is a measly 0.01 for both. This is likely due to the fact that a majority of the average ratings falls above 4.
For my final model I engineered new features that could help. These features were avg_step_desc_length
, description_length
, and year_submitted
.
calories
The number of calories in the recipe. I used RobustScaler
to transform calories
because many outliers are present in the calories
column. Thus, the values were scaled while also handling outliers.
is_dietary
1 if the recipe contains the dietary tag and 0 otherwise. From the EDA, I noticed that dietary recipes had slightly lower ratings than non-dietary ratings. This may prove to be usedful for classifying average rating.
avg_step_desc_length
Using the steps
column I took the length of each individual step and calculated the average length of each step. Recipes with higher average step length have more in depth steps to follow. After plotting and overlaid histogram I found that recipes with longer average step descriptions tend to have slightly higher average scores than recipes with shorter descriptions. I applied StandardScaler
to avg_step_desc_length
inorder to make sure that average step description lengths were compareable in range because some recipes featured very long steps.
description_length
Using the description
column I extracted the length of each description. Detailed descriptions may sway the way people rate recipes. After plotting an overlaid hisogram, I found that recipes with longer descriptions tend to have slightly lower average scores than recipes with shorter descriptions. I applied StandardScaler
to description_length
inorder to make sure that description lengths were compareable in range because some recipes featured very description.
year_submitted
Using the submitted
column I extracted the year each recipe was submitted. I grouped the dataset by year_submitted
, selected the avg_rating
column, then took the mean of the avg_rating
values. Using this I was able to create a line plot that showed that recipes submitted more recently had lower mean average score. This could be helpful for classifying average rating.
I also implemented GridSearchCV
to hyperparameterize my random forest classifier. I aimed to tune the parameters, n_estimators
, max_depth
, and min_samples_split
. It was found that the best parameters were a 50 n_estimators
, a max depth
of 5, and a min_samples_split
of 2.
My final model ended up with an F1 score of 0.91 and an accuracy of 0.94. An small improvement from my base model, but unfortunately it did not improve much classifying “medium” and “low” calorie recipes. Looking back it may have been easier to predict a variable with more variance. Roughly, 58.9% of the average ratings were 5.0 which most likely made prediciting classes outside of “high” difficult.
For my fairness analysis, I split the dataset into low and high calorie recipes. Similar to my hypothesis test, low-calorire recipes are recipes that are below the median and high-calorie recipes are recipes that are above.
Null Hypothesis:
Our model is fair. Its precision for low calorie and high calorie recipes are roughly the same, and any differences are due to random chance.
Alternative Hypothesis:
Our model is unfair. Its precision for low calorie recipes is higher than its precision for high calorie recipes.
Test Statistic:
Difference in Precission (High - Low)
Significance Level:
0.05
I ran a permutation test to find an outcome. I shuffled the calorie_level
column 1000 times and calculated the mean differences between the high and low calorie groups. I got an observed test statistic of 0.00 that is displayed as the red line. The p-value for this test was 0.469 which is greater than the significance level of 0.05. Thus, we fail to reject the null hypothesis. Our model’s precision for low and high calories are roughly the same.