Recipes and Rating Research Project
This is the Project 3 under course DSC80
By Jiangqi Wu & Yuxuan Zhang
Introduction
The world of cooking and food has grown exponentially over the years, with a vast array of recipes and cuisines available at our fingertips. In this project, we aim to dive into an extensive dataset consisting of recipes and their respective ratings. Our goal is to uncover trends and patterns that can provide valuable insights into the factors contributing to the popularity and success of a recipe.
The dataset is sourced from food.com and contains recipes and reviews posted since 2008. The data is divided into two parts: recipes and ratings. The recipes dataset includes information such as recipe name, ID, preparation time, contributor ID, submission date, tags, nutrition information, number of steps, steps text, and description. The ratings dataset, on the other hand, contains user ID, recipe ID, date of interaction, rating, and review text.
The dataframe for recipes have 83782 rows, meaning 83782 unique recipes. The dataframe for ratings have 731927 rows, each stand for a review on the recipe. For furthur research purposes, we add column for average rating for each recipes in the dataframe, so we could analyze the features of recipes and corresponding ratings.
In the project, we will be first cleaning the data set and conduct exploratory data analysis, to obtain some basic information of the data set and relation between columns. Then, we will assess the missingness contained in the data set by NMAR analysis and analyzing the missingness dependency.
In the missingness analysis, we figure out the missingness of the description, review, and rating in the merged dataframe. In the NMAR analysis, we focus on the review column and provide a reasonable description about the missingness of some reviews. In the MCAR and MAR analysis, we implement the dependency test to explore the dependency of the missingness of the rating column in the merged dataframe. More specifically, we test the dependency of the missingness on the minutes and the calories.
Moreover, we would focus on the research question that, are complex recipes and simple recipes rated in the same scale. We would define recipe with fewer than 10 steps as simple recipes, and with more than 10 steps as complex recipes. We would analyze the rating scale related to the complexity of the recipe.
This research question could be important for recipe-designers and food.com website holder. By answering this research question, we could possibly provide the viewpoints to the rating scale for people who use this website. With our result stating whether people would prefer complex recipes or not, recipe-designers could design more complex/simple recipes to meet the need to people using the website.
Data Cleaning and EDA
Before conducting analysis related to the datasets, we would first conduct data cleaning to make the data more convenient for analysis.
Checking Data Types
First, I check the data type for each column and think about the necessary data cleaning steps.
name object
id int64
minutes int64
contributor_id int64
submitted object
tags object
nutrition object
n_steps int64
steps object
description object
ingredients object
n_ingredients int64
dtype: object
Converting Object to List
The first step we are going to do to the dataframe is the tags, steps and ingredients columns. The three column all look like lists of string, but by checking the specific entry in the dataframe, we find that they are actually not lists. This could due to when web scraping, data collecter does not convert the text into list. As a result, we take action to convert the these three columns into list of string.
Converting
nutrition
Column to List and Assign Individual Columns
We also find that the nutrition
column in the dataframe look like a list containing float but actually not. We find that in the list, the float represent: 'calories', 'total fat (PDV)', 'sugar (PDV)', 'sodium (PDV)', 'protein (PDV)', 'saturated fat (PDV)', 'carbohydrates (PDV)'
. As a result, we first convert the column into list of float and create individual column for each nutrition. In each column, we also convert the data to float data type that is convenient for later calculation.
Changing submitted column into datetime
The submitted
column in the dataframe is presented as a object. We change it into datetime data type so we could apply basic operations later.
Merging Two Dataframes
Then, since there are two dataframe but with common column, which are id
and recipe_id
. As a result, we merge the two dataframe together to show the recipes and corresponding rating and reviews.
merged dataframe head needed
Adding Average Rating Column
After merging the two dataframes, we find one important data is the rating for the recipes. As a result, we add new column name ave_rating
, which include the average rating for the column. Also, we beleve that the 0 in the rating might be empty rating that people do not fill in. As a result, we replace 0 with nan value
Cleaning Result
The cleaned dataframe’s datatype for each column is listed below.
id int64
name object
minutes int64
contributor_id int64
submitted datetime64[ns]
tags object
nutrition object
n_steps int64
steps object
description object
ingredients object
n_ingredients int64
ave_rating float64
calories float64
total fat (PDV) float64
sugar (PDV) float64
sodium (PDV) float64
protein (PDV) float64
saturated fat (PDV) float64
carbohydrates (PDV) float64
dtype: object
The cleaned dataframe is shown below (with only important columns selected for display).
id | name | minutes | submitted | n_steps | n_ingredients | ave_rating |
---|---|---|---|---|---|---|
333281 | 1 brownies in the world best ever | 40 | 2008-10-27 00:00:00 | 10 | 9 | 4 |
453467 | 1 in canada chocolate chip cookies | 45 | 2011-04-11 00:00:00 | 12 | 11 | 5 |
306168 | 412 broccoli casserole | 40 | 2008-05-30 00:00:00 | 6 | 9 | 5 |
286009 | millionaire pound cake | 120 | 2008-02-12 00:00:00 | 7 | 7 | 5 |
475785 | 2000 meatloaf | 90 | 2012-03-06 00:00:00 | 17 | 13 | 5 |
The merged dataframe is shown below (with only important columns selected for display).
id | name | minutes | submitted | user_id | date | rating | review |
---|---|---|---|---|---|---|---|
333281 | 1 brownies in the world best ever | 40 | 2008-10-27 | 386585 | 2008-11-19 | 4 | These were pretty good, but took forever to bake. I would send it ended up being almost an hour! Even then, the brownies stuck to the foil, and were on the overly moist side and not easy to cut. They did taste quite rich, though! Made for My 3 Chefs. |
453467 | 1 in canada chocolate chip cookies | 45 | 2011-04-11 | 424680 | 2012-01-26 | 5 | Originally I was gonna cut the recipe in half (just the 2 of us here), but then we had a park-wide yard sale, & I made the whole batch & used them as enticements for potential buyers ~ what the hey, a free cookie as delicious as these are, definitely works its magic! Will be making these again, for sure! Thanks for posting the recipe! |
306168 | 412 broccoli casserole | 40 | 2008-05-30 | 29782 | 2008-12-31 | 5 | This was one of the best broccoli casseroles that I have ever made. I made my own chicken soup for this recipe. I was a bit worried about the tsp of soy sauce but it gave the casserole the best flavor. YUM! |
Exploratory Data Analysis
Univariate Analysis
In the univariate analysis, we would analyze the distribution of number of ingredients and the distribution of number of steps
This shows that the distribution could be approximate as a gaussian distribution but skewed right. We would say that the graph centered around 8, meaning that most recipes have 8 ingredients.
The distrubution also show similar trend in the number of steps, which is a right skewed gaussian distribution. By comparing at the two graph, the graph for the number of distribution is more centered. The center for the graph is around 7, meaning most recipes have 7 steps. Also, we could see the graph have a lot outliers that have very big step numbers. After observing the dataset and also consider together with the minutes
column and real life situation, we decided to choose steps greater than 40 and minutes greater than 200 as outlier and not faithful data.
Bivariate Analysis
Then, we do bivariate analysis between the number of steps and the number of ingredients
When the individual distritbution for number of steps and number of ingredients seems very similar, the scatter plot does not show very strong correlation between the number of steps and number of ingredient. We could say that there is weak positive relationship between the number of steps and the number of ingredients.
We could see that the average rating and the number of ingredients in the recipes do not have much relationship with each other. Especially with number of ingredients smaller than 15, it is almost a horizontal line, showing no relationship between the two variables. The large fluctuate with number of ingredients larger than 15 could be due to relatively small data size collected within that range.
Interesting Aggregates
In the aggregates analysis, we will study the total fat with the cooking minutes
(‘minutes’, ‘’) | (‘mean’, ‘total fat (PDV)’) | (‘median’, ‘total fat (PDV)’) | (‘min’, ‘total fat (PDV)’) | (‘max’, ‘total fat (PDV)’) |
---|---|---|---|---|
0 | 46 | 46 | 46 | 46 |
1 | 7.78603 | 0 | 0 | 159 |
2 | 9.69053 | 0 | 0 | 419 |
3 | 12.5794 | 2 | 0 | 411 |
4 | 20.4719 | 7 | 0 | 258 |
This is the pivot table for the total fat
and minutes
One interesting result that we find in the aggregates data is that there is a peek for total fat in the recipe around 60 minutes of cooking time. Otherwise the recipes’ total fat is fluctuate around 50 PDV, which is around 1000 calories. This shows that most recipes collected are recipes for health food.
Assessment of Missingness
In this part, we will be conducting assessment of missingness on the merged dataframe.
NMAR Analysis
In the NMAR, we focus on the missingness of the review in the merged dataframe. The missingness of the review is probably because someone think that the recipe is easy and there is nothing worth talking about. Thus, they just skip the review. If we want additional data to prove it (making it MAR), we could add a personal difficulty evaluation for every person who use this recipe.
Missingness Dependency
Now, we focus on the missingness of rating in the merged dataframe and test the dependency of this missingness. We are preparing to test the dependency of the missingness on the minutes, the time to finish the recipe, and the calories, the energy of the recipe.
Minutes and Rating
Null hypothesis: the distribution of the minutes when rating is missing is the same as the distribution of the minutes when rating is not missing Alternative hypothesis: the distribution of the minutes when rating is missing is different from the distribution of the minutes when rating is not missing Observed Statistics: the absolute difference between minutes mean of these two distributions. We also draw distribution plots about these two distributions.
We use permutation test to shuffle the missingness of rating 1000 times and get 1000 simulating results about the absolute difference.
Finally, we calculate the p-value 0.127. when we use 0.05 as a significance threshold, since 0.127 > 0.05, we fail to reject the null hypothesis that the rating is not dependent on the minutes. Based on our test result, we can see that the missingness of the rating is MCAR because the missingness of rating is not correlated with the minutes.
Calories and Rating
Null hypothesis: the distribution of the calories when rating is missing is the same as the distribution of the calories when rating is not missing Alternative hypothesis: the distribution of the calories when rating is missing is different from the distribution of the calories when rating is not missing Observed Statistics: the absolute difference between calories mean of these two distributions. We also draw distribution plots about these two distributions.
We use permutation test to shuffle the missingness of rating 1000 times and get 1000 simulating results about the absolute difference.
Finally, we calculate the p-value approximately 0.0. when we use 0.05 as a significance threshold, since 0.0 <= 0.05, we reject the null hypothesis that the rating is not dependent on the calories. Based on the result, we can see that the missingness of rating is the MAR because the rating is dependent on the calories. Probably if some recipes with relatively high or low calories will be more likely to have the missingness in the rating.
Hypothesis Testing
The question we are going to research on is that: are regular recipes and complex recipes are rated in the same scale?
In this part, we will define a complex recipes as recipes have greater than 10 steps. We will conduct a permutation test.
Setting Up the Testing
Null Hypothesis H0: People are rating all the recipes in the same scale.
Alternative Hypothesis H1: People are giving complex recipe lower rating
We would select only the useful column, including n_steps and average rating, and also create new column named complex
, which is true if it has steps more than 10 and false if has steps less or equal to 10.
id | n_steps | ave_rating | complex |
---|---|---|---|
333281 | 10 | 4 | False |
453467 | 12 | 5 | True |
306168 | 6 | 5 | False |
286009 | 7 | 5 | False |
475785 | 17 | 5 | True |
The reason for choosing one-sided test is that we might assume people could feel frustrated when cooking complex recipes, and also recipes with more steps are harder to cook
complex | n_steps | ave_rating |
---|---|---|
False | 6.35718 | 4.50184 |
True | 16.1414 | 4.48441 |
Since ave_rating is numerical data, so it is proper to use the difference in mean as test statistics. In the part of research, the significant level we choose is 0.05
The observed difference in mean is 0.017428379224658563
Permutation Test
We ran permutation test for 10000 times and the graph shows the distribution of permutation test result. The red line marks the observed value.
Hypothesis Testing Conclusion
The P-value for the testing is 0.009, which means that at significant level of 0.05, we are able to reject the null hypothesis.
This result could be reasonable since first, high level complexity of a recipes could mean difficulties in cooking the dish and increasing probability in failing. If people fail to cook the dish, they might give low rating to the recipe. Also, people might have higher expectation on the dish if it is complex and hard to make. Then, people might have a more strict rating scale for complex recipes.