Skip to the content.

Recipe-Analysis

Introduction

This project will be dealing with a Recipe CSV file and Rating CSV file which can both be found on food.com. These files have information about the rating of all recipes by individuals, nutritional information about recipes, and much more. This project deals with the following question: how does the number of steps in a recipe affect the nutrition of the recipe. In general the number of steps in a recipe generally correlates to the amount of actions done to ingredients (like boiling or deep frying) so this question can help answer how exactly the number of steps affects nutrition and whether or not cooking removes or adds nutrition to foods. In particular we will be analyzing carbohydrates, calories, sugar, total fat, saturated fat, sodium, and protein in which all but calories deal with percentage recommended daily intake. We will be dealing the n_steps (number of steps in a recipe) and nutrition as our columns and there are totally 83,623 recipes to analyze.

Cleaning

The Recipe CSV file contain all information about nutrition of recipes, number of steps, and ingredients and the Rating csv file has a bunch of review and a recipe id to connect each review with a recipe. We can merge these two csv files by the recipe id in which now each review also has more information about things like nutrition and ingredients. People must give ratings between 1 and 5 inclusive, however when we look at data frame we observe that there are reviews with rating 0, hence it is reasonable to assume that these are people that didint give a rating hence we can replace all of the 0s with Null value as that would represent no response and has no weight when dealing with things like average rating of recipes. And finally we have the nutrition column, in the initial file given all of the nutrition is given as a list so we must seperate it into each individual nutrition and add seperate columns for each of them. After dropping unnecessary columns we get the following dataframe:

n_steps avg_rating calories total fat sugar sodium protein saturated fat carbohydrates n_ingredients
0 15 4.750000 384.7 0.0 0.0 70.0 159.0 0.0 6.0 3
1 5 5.000000 304.1 13.0 121.0 3.0 13.0 9.0 19.0 3
2 5 4.777778 26.8 0.0 11.0 5.0 2.0 0.0 1.0 14
3 2 5.000000 40.7 0.0 17.0 2.0 4.0 0.0 2.0 11
4 4 5.000000 146.5 0.0 0.0 0.0 0.0 0.0 0.0 4

Exploratory Data Analysis

To start, lets look at how exactly the distribution of steps looks like:

As we see it looks like the distribution of steps is centered around 9 and its appears to follow a normal distribution which is what we would expect with such a large sample size. Hence we will be able to conduct a hypothesis test centered around 9 to see nutritional affect in the future. Lets also look at the distribution of a specific nutrient type (fat):

As we can see the fat appears to have a ton of 0 values, this would mean that not all recipes have fat which might skew our distribution when we conduct a hypothesis test. Now lets see the distribution of number of steps vs carbohydrates.

This graph appears to follow some linear distribution with a higher amount of steps correlating to a higher amount of carbohydrates. But another question that can be asked is what if steps is not the true reason for higher nutrition, but instead a higher amount of steps results in higher amount of ingredient which instead is the cause for higher nutrition. We can see whether or not his is true by creating a pivot table with index as the steps, ingredients as the columns and values as calories.

n_ingredients 1 2 3 4 5 6 7 8
n_steps
1 NaN 213.295082 246.122078 226.167308 324.183992 223.804375 308.724324 313.586420
2 1851.75 224.256471 236.727032 247.509088 266.345312 287.740701 317.015464 318.145852
3 201.20 266.496460 251.027123 270.738571 277.039463 299.443945 320.241935 324.131434
4 284.80 391.448039 237.642314 385.776799 329.568616 312.687052 321.440529 340.693716
5 199.50 303.427848 359.144558 314.562428 346.674227 326.139122 360.033325 334.946556
6 NaN 356.192857 424.277632 365.591860 376.002450 374.634507 397.015855 361.519975
7 3116.20 407.926190 334.634911 320.568382 337.915513 359.516051 389.614938 391.944647
8 NaN 280.927907 317.607947 300.453793 333.636905 385.448466 389.633145 380.285597

It appears as we move right on the table, the calories dont necessarily follow any pattern in any of the rows hence this idea that maybe ingredients is the cause for the correlation is likely not correct. Hence we can continue to investigate just how the number of steps affects nutrition.

Missing Values

It is very likely that the rating column’s missingness might be “not missing at random” (NMAR). A factor that might impact this is how many times the person used this recipe, a lower amount would likely mean that they do not have as much invested into the recipe and dontwant to give a rating but a higher amount would mean that they are more likely to give a review. Hence if we were given this data we could possible conclude that rating’s missingness is dependant on number of times a person used the recipe. However, we will investigate if ratings is missing at random. We will plot the distribution of sodium of only missing rating values and only non-missing rating values, we get the following distribution:

True reprersents the data whose rating is not missing and false is whose rating is missing. The distributions look fairly similar so we can hypothesize whether or not missingness is dependent on sodium is unlikely. Using kolmogorov-smirnov test statistic we can find the mean amount of difference between the 2 distributions, and using permutation test to find how much these two will vary on average, we eventually find that the existing probability that our scenerio happens or anything more exterme happens is .17 (the p value) which is large enough to where we can say that missingness is not likely dependant on sodium is a significance level of .05. However if we were to do the same test with calories we find the following distribution:

and now these distribution look fairly different. And when we calculate the test statistic and compare with permutations generated test statistics we find the following empirical distribution.

The dotted line represents the test statistic of the dataframe and evidently since its so far away from the rest of the distribution it is extremely unlikely that this is completely random. Hence we find that p = 0.00 and hence it is very likely rating missingness is dependant on calories when we have a significance level of .01. Hence rating is likely missing at random with respect to the calories column. All of this is important to us since we want to know how the nutrition actually affects the results in our dataframe as a whole and especially rating.

Hypothesis Test

Now we want to figure out just how the number of steps in a recipe relates to the nutrition, in particular if there is a significant difference between nutritional information with steps less than 9 and greater than or equal to 9. Hence we will use the following test statistic: (mean of calories steps >= 9 - mean of calories steps < 9)/(mean of cal) + (mean of fat >= 9 - mean of fat < 9)/(mean of fat) and so on with carbohydrates, sodium, saturated fat, protein, and sugar. Then our null hypothesis is that the test stastic is equal to 0, and our alternate is that our test statistic is greater than 0. In other words our null is that there is no significant difference in nutrition between steps < 9 and >= 9, however our alternate is that there is more nutrition in the steps >= 9 when compared to less than 9. The reason our test statistic divides by mean is to essentially standardize each part so they all have the same weight. Hence we can now calculate our test statistic of our dataframe which gives test statstic of .257. And now we use another permutation test to randomize which data points are less than 9 steps and greater than or equal to 9 steps to get a distribution of these test statistics. Plotting these statistics gives the following:

and looking at this it seems there is a significant difference between our test statistic and the randomized test statics from the permutation test. And when we find the probability of our test statstic to happen, we find that p = 0.00. Hence using a significance level of .01, we can reject the null hypothesis that there is no difference between nutrition and whether the the number of steps is greater than or equal to 9 and less than 9. Hence it is highly likely we accept the alternate hypothesis that if steps >= 9 then there a higher nutrition content when compared to less than 9 steps.