Predicting Recipe Rating based on Recipe Length
Author: Jahnavi Naik
Introduction
Food is an important aspect of everyone’s life, especially mine. Food is not just a necessity, but cooking and baking is a hobby and profession for many. Naturally, food.com becomes a prominent website for finding recipes for a variety of dishes, and even allows you to leave reviews and ratings, helping others determine what recipes to try. When looking for a recipe to make, many home cooks value the time it takes to actually make a recipe, to fit it in their busy schedules. Because of this, I think it is important to understand the relationship between the length of the recipe, and its rating, so I decided to focus on the question, How does the length of the recipe/ ingredients affect the ratings of the recipe? For this project, I obtained 2 datasets, originally taken from food.com. The first dataset, called recipes, contains 12 columns, and 83782 rows. The second dataset, called reviews, contains 5 columns and 731927 rows.
The columns in the recipe dataframe that are relevant to my focus are:
| Column | Description |
|---|---|
| id | The recipe ID, which is unique per recipe |
| minutes | The number of minutes it takes to complete a recipe |
| nutrition | A string (that looks like a list) of various nutrition facts including calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV) |
| n_steps | The number of steps in the recipe |
| ingredients | A string (that looks like a list) of ingredients used in the recipe |
| n_ingredients | The number of ingredients used in the recipe |
The columns of the reviews dataframe that are relevant to my focus are:
| Column | Description |
|---|---|
| recipe_id | the recipe id, matching the id column in the recipes dataframe |
| rating | the rating given by the reviewer on a 1 - 5 scale |
Data Cleaning and Exploratory Analysis
I started by merging these two dataframes on the recipe id, in order to get a dataframe that contains all of the reviews of all the recipes. After merging, I found that some of the ratings were missing, and decided to fill them with 0 in order to avoid bias.
Next, I decided to create a new column in the dataset for the average rating. The current dataframe has more than one row for each review, so it is necessary to get the average rating for each recipe to get an understanding of the overall assessment, as it will be needed for future analysis.
I then went on to clean the individual columns of the Dataframe. I first focused on the nutrition column. This column contains a list of many different nutrition facts, and I decided that splitting this column up to have a column for each nutrition fact would be more useful for my analysis. Although this column looks like a list, it is actually a string, so I cleaned it by removing the brackets and then splitting it by the comma, in order to turn it into a list. I was then able to split this column of lists into multiple columns that each represented a nutrition fact.
Lastly, I created a new column called has_sugar. To do this, I applied a function to the ingredients column to check if the word ‘sugar’ was in the string, returning boolean values in this column. Because sugar is a popular topic when it comes to food, I thought this column would be useful in my later prediction model.
The final, cleaned, dataframe looks something like this:
| name | id | minutes | n_steps | n_ingredients | calories (#) | has_sugar | avg_rating |
|---|---|---|---|---|---|---|---|
| 1 brownies in the world best ever | 333281 | 40 | 10 | 9 | 138.4 | True | 4.0 |
| 1 in canada chocolate chip cookies | 453467 | 45 | 12 | 11 | 595.1 | True | 5.0 |
| 412 broccoli casserole | 306168 | 40 | 6 | 9 | 194.8 | False | 5.0 |
| 412 broccoli casserole | 306168 | 40 | 6 | 9 | 194.8 | False | 5.0 |
| 412 broccoli casserole | 306168 | 40 | 6 | 9 | 194.8 | False | 5.0 |
Univariate Analysis
To start my exploration of the data I was working with, I created a histogram to visualize the distribution of the number of steps in the recipe. The number of steps in this dataframe ranges from 1 to 100 steps. The graph shows that they data is skewed to the right, and that most recipes have between 5-9 steps.
I also decided to create a histogram for the distribution of the minutes it takes to do the recipes in the dataset. The distribution is skewed to the right with a very long tail, making the graph unreadable. To combat this, I decided to remove the outliers that I determined using the outlier test. THe average minutes it takes for a recipe in this dataset is 36 minutes.
Bivariate Anaylsis
I then decided to look at the relationship between pairs of variables, specifically the relationship between the average rating and the number of steps in the recipe. The scatter plot below shows that there seems to be a somewhat positive relationship between these variabes, meaning as the recipes with a higher average rating tend to have more steps.
Aggregates
The pivot table below shows the mean, median, and mode of the number of calories per the number of ingredients in the recipe. Outliers were removed using the IQR rule. THe pivot table shows that as the number of ingredients increases, the number of calories of the recipe also increases, showing a positive relationship between the two variables.
| mean | median | std | |
|---|---|---|---|
| calories (#) | calories (#) | calories (#) | |
| n_ingredients | |||
| 1 | 157.229630 | 144.20 | 113.179463 |
| 2 | 212.604051 | 144.70 | 205.470241 |
| 3 | 215.911121 | 163.60 | 191.614204 |
| 4 | 237.279823 | 183.60 | 193.219294 |
| 5 | 262.716944 | 219.00 | 191.958962 |
| 6 | 285.390661 | 239.90 | 207.755204 |
| 7 | 302.350313 | 252.20 | 205.388476 |
| 8 | 318.771192 | 275.20 | 204.539190 |
| 9 | 336.953357 | 294.90 | 207.894473 |
| 10 | 341.251078 | 304.10 | 201.831227 |
| 11 | 363.584400 | 324.80 | 207.621016 |
| 12 | 364.438348 | 326.40 | 202.957711 |
| 13 | 384.200156 | 359.50 | 205.592255 |
| 14 | 402.556815 | 374.30 | 211.273798 |
| 15 | 422.722760 | 392.50 | 215.983254 |
| 16 | 433.215455 | 416.70 | 210.721340 |
| 17 | 458.555769 | 419.70 | 216.732349 |
| 18 | 469.239942 | 440.20 | 215.431545 |
| 19 | 469.590974 | 439.10 | 224.048137 |
| 20 | 498.537207 | 452.70 | 237.817307 |
| 21 | 452.178998 | 416.80 | 218.114314 |
| 22 | 572.306952 | 620.50 | 242.775594 |
| 23 | 476.350549 | 471.10 | 211.118148 |
| 24 | 505.260976 | 454.50 | 214.630693 |
| 25 | 540.551163 | 606.40 | 271.928627 |
| 26 | 544.987273 | 572.70 | 178.086653 |
| 27 | 583.072222 | 589.85 | 214.616784 |
| 28 | 559.124324 | 491.70 | 257.784868 |
| 29 | 442.145000 | 336.20 | 169.228984 |
| 30 | 580.293103 | 594.30 | 165.961585 |
| 31 | 348.242857 | 219.60 | 172.555632 |
| 32 | 363.100000 | 363.10 | NaN |
| 33 | 338.200000 | 338.20 | NaN |
Assessment of Missingness
The review column in the dataset has missing data, and I believe the missingness is NMAR, not missing at random. I believe this because most of the time, people only take the time to write out and post a review if they have strong feelings about what they are reviewing and want to express their feelings. Because of this, the missingness of the column depends on how they feel which is the value itself, so the missingness is classified as NMAR.
To determine the missingness for the ratings column, I ran a permutation test to assess if the missingness of the rating column was due to the minutes column.
Null hypothesis: Missingness of rating does not depend on the number of minutes. Alternative hypothesis: the missingness of the rating depends on the number of minutes
After running the test, I obtained a p-value of 0.036, which I failed to reject at the 0.01 significance level, and therefore determined that the missingness of ratings was not dependant on the minutes column.
I then ran a permutation test to assess if the missingness for the ratings column was due to the number of calories in the recipe.
Null hypothesis: Missingness of rating does not depend on the number of calories in the recipe Alternative hypothesis: Missingness of the rating depends on the number of calories in the recipe
After running the permutation test, I obtained a p-value of 0.0, and therefore rejected the null hypothesis at the 0.01 significance level, and determined that the missingness of the rating column was dependant on the calories column, and therefore the missingness of the rating column is MAR.
Hypothesis Testing
Next, I ran a permutation test to determine if there is a relationship between the length of a recipe and its average rating.
Null Hypothesis: There is no relationship between time it takes for a recipe and the average rating of a recipe. Alternative Hypothesis: Recipes that take over 37 minutes have a lower average rating than ones that take 37 or less minutes.
In order to avoid bias, I removed the outliers from the dataframe using the IQR rule. I then created a new column with boolean values that determined if the minutes of the recipe was less than or equal to 37 minutes. My test statistic was the difference in means between recipes under 37 minutes and over 37 minutes. The significance level I chose was 0.01, and the p-value I found was 0.0, so I therefore reject the null hypothesis, and conclude that there is a relationship between the length fo a recipe and its average rating. This test allows me to determine that that recipe length and ratinge are not independent, and allows me to get closer to understanding thier relationship.
Framing a Prediction Problem
The prediction problem I will focus on is predicting the rating of a recipe, and it is a multiclass classification, as the rating is on a 1-5 scale, where the rating can be treated as a ordinal categorical variable. I use a Random Forest Classifier and I chose the response variable to be the average rating of the recipe, as I think it would be interesting to see how the different variables can play a part in the decision of rating a recipe. It could also help to understand the thought process that goes into rating a recipe. The metric I am using to evaluate the model is the F-1 score, as this metric takes into account the imbalances of the dataset, which will a allow for a more balanced evaluation of the model that other metrics don’t have. The things we would know at the time of the prediction are the variables we obtain from the recipes from the wensite, including things like the ingredients, the steps, how many minutes it will take, etc.
Baseline Model
My baseline model contains only 2 features, the number of steps and the number of ingredients that the recipe states. Both of these variables are numerical variables. I encoded both of these variables using the standard scaler transformation, in order to measure them on equal scales and make sure that large ranges dont cause bias. I used F1-score to evaluate the model, and got a score of 0.638. I dont think this model is very good as it only uses 2 features that I dont give a complete picture of the recipe, so I think a few other features are needed to make the model more complex and perform better. Also, the F1-score is a little low, so I think adding more features that are important to a recipe will help increase this score.
Final Model
The features I used in the final model are minutes, n_steps, n_ingredients, calories(#), and has_sugar. I chose to include the minutes and calories features, as these features showed to have a relationship with the ratings column and n_ingredients column, meaning that it would add complexity to the model. I applied the standard scaler transformation to these columns to measure them on the same scale and avoid bias. I also chose to include the has_sugar feature, which I believed would help my model, as sugar is something people pay attention to when choosing a recipe. As this is a categorical feature, I one-hot encoded it in order to use it in my model.
I used a Random Forest Classifier for this model, as it would use predictions from many different trees, which would help to reduce overfitting. In order to choose my hyperparameters, I used GridSearchCV from sklearn along with 5-fold validation in order to determine the best value to use for max_depth and n_estimators. After running this, I found that the best parameters were a max depth of None and 200 for n_estimators. The F1-score that this model recieved was 0.914, which is a 0.276 increase from the baseline model. I think that this model is better suited for the data than the baseline model, as it takes into account more variables related to the dataset, giving the model a good overview in order to make a properly educated prediction.
Fairness Analysis
To evaluate the fairness of the model, I decided to determine if the precision of the model was roughly the same between shorter recipes (35 minutes or less) and longer recipes (over 35 minutes).
Null Hypothesis: The model is fair. The precision for shorter recipes is roughly the same as longer recipes Alternative Hypothesis: The model is unfair. The precision for shorter recipes is lower than longer recipes
My test statistic was the difference between the precision of the model on shorter recipes and longer recipes (precision of shorter - precision of longer). I chose a significanve level of 0.01, and obtained a p-value of 0.88, and therefore fail to reject the null hypothesis. I conclucde that the model is fair between shorter and longer recipes.