Recipe-Analysis

Introduction

This project will be dealing with a Recipe CSV file and Rating CSV file which can both be found on food.com. These files have information about the rating of all recipes by individuals, nutritional information about recipes, and much more. This project deals with the following question: how does the number of steps in a recipe affect the nutrition of the recipe. In general the number of steps in a recipe generally correlates to the amount of actions done to ingredients (like boiling or deep frying) so this question can help answer how exactly the number of steps affects nutrition and whether or not cooking removes or adds nutrition to foods. In particular we will be analyzing carbohydrates, calories, sugar, total fat, saturated fat, sodium, and protein in which all but calories deal with percentage recommended daily intake. We will be dealing the n_steps (number of steps in a recipe) and nutrition as our columns and there are totally 83,623 recipes to analyze.

Cleaning

The Recipe CSV file contain all information about nutrition of recipes, number of steps, and ingredients and the Rating csv file has a bunch of review and a recipe id to connect each review with a recipe. We can merge these two csv files by the recipe id in which now each review also has more information about things like nutrition and ingredients. People must give ratings between 1 and 5 inclusive, however when we look at data frame we observe that there are reviews with rating 0, hence it is reasonable to assume that these are people that didint give a rating hence we can replace all of the 0s with Null value as that would represent no response and has no weight when dealing with things like average rating of recipes. And finally we have the nutrition column, in the initial file given all of the nutrition is given as a list so we must seperate it into each individual nutrition and add seperate columns for each of them. After dropping unnecessary columns we get the following dataframe:

	n_steps	avg_rating	calories	total fat	sugar	sodium	protein	saturated fat	carbohydrates	n_ingredients
0	15	4.750000	384.7	0.0	0.0	70.0	159.0	0.0	6.0	3
1	5	5.000000	304.1	13.0	121.0	3.0	13.0	9.0	19.0	3
2	5	4.777778	26.8	0.0	11.0	5.0	2.0	0.0	1.0	14
3	2	5.000000	40.7	0.0	17.0	2.0	4.0	0.0	2.0	11
4	4	5.000000	146.5	0.0	0.0	0.0	0.0	0.0	0.0	4

Exploratory Data Analysis

To start, lets look at how exactly the distribution of steps looks like:

As we see it looks like the distribution of steps is centered around 9 and its appears to follow a normal distribution which is what we would expect with such a large sample size. Hence we will be able to conduct a hypothesis test centered around 9 to see nutritional affect in the future. Lets also look at the distribution of a specific nutrient type (fat):

As we can see the fat appears to have a ton of 0 values, this would mean that not all recipes have fat which might skew our distribution when we conduct a hypothesis test. Now lets see the distribution of number of steps vs carbohydrates.

This graph appears to follow some linear distribution with a higher amount of steps correlating to a higher amount of carbohydrates. But another question that can be asked is what if steps is not the true reason for higher nutrition, but instead a higher amount of steps results in higher amount of ingredient which instead is the cause for higher nutrition. We can see whether or not his is true by creating a pivot table with index as the steps, ingredients as the columns and values as calories.

n_ingredients	1	2	3	4	5	6	7	8
n_steps
1	NaN	213.295082	246.122078	226.167308	324.183992	223.804375	308.724324	313.586420
2	1851.75	224.256471	236.727032	247.509088	266.345312	287.740701	317.015464	318.145852
3	201.20	266.496460	251.027123	270.738571	277.039463	299.443945	320.241935	324.131434
4	284.80	391.448039	237.642314	385.776799	329.568616	312.687052	321.440529	340.693716
5	199.50	303.427848	359.144558	314.562428	346.674227	326.139122	360.033325	334.946556
6	NaN	356.192857	424.277632	365.591860	376.002450	374.634507	397.015855	361.519975
7	3116.20	407.926190	334.634911	320.568382	337.915513	359.516051	389.614938	391.944647
8	NaN	280.927907	317.607947	300.453793	333.636905	385.448466	389.633145	380.285597

It appears as we move right on the table, the calories dont necessarily follow any pattern in any of the rows hence this idea that maybe ingredients is the cause for the correlation is likely not correct. Hence we can continue to investigate just how the number of steps affects nutrition.

Missing Values

It is very likely that the rating column’s missingness might be “not missing at random” (NMAR). A factor that might impact this is how many times the person used this recipe, a lower amount would likely mean that they do not have as much invested into the recipe and dontwant to give a rating but a higher amount would mean that they are more likely to give a review. Hence if we were given this data we could possible conclude that rating’s missingness is dependant on number of times a person used the recipe. However, we will investigate if ratings is missing at random. We will plot the distribution of sodium of only missing rating values and only non-missing rating values, we get the following distribution:

True reprersents the data whose rating is not missing and false is whose rating is missing. The distributions look fairly similar so we can hypothesize whether or not missingness is dependent on sodium is unlikely. Using kolmogorov-smirnov test statistic we can find the mean amount of difference between the 2 distributions, and using permutation test to find how much these two will vary on average, we eventually find that the existing probability that our scenerio happens or anything more exterme happens is .17 (the p value) which is large enough to where we can say that missingness is not likely dependant on sodium is a significance level of .05. However if we were to do the same test with calories we find the following distribution:

and now these distribution look fairly different. And when we calculate the test statistic and compare with permutations generated test statistics we find the following empirical distribution.

The dotted line represents the test statistic of the dataframe and evidently since its so far away from the rest of the distribution it is extremely unlikely that this is completely random. Hence we find that p = 0.00 and hence it is very likely rating missingness is dependant on calories when we have a significance level of .01. Hence rating is likely missing at random with respect to the calories column. All of this is important to us since we want to know how the nutrition actually affects the results in our dataframe as a whole and especially rating.

Hypothesis Test

Now we want to figure out just how the number of steps in a recipe relates to the nutrition, in particular if there is a significant difference between nutritional information with steps less than 9 and greater than or equal to 9. Hence we will use the following test statistic: (mean of calories steps >= 9 - mean of calories steps < 9)/(mean of cal) + (mean of fat >= 9 - mean of fat < 9)/(mean of fat) and so on with carbohydrates, sodium, saturated fat, protein, and sugar. Then our null hypothesis is that the test stastic is equal to 0, and our alternate is that our test statistic is greater than 0. In other words our null is that there is no significant difference in nutrition between steps < 9 and >= 9, however our alternate is that there is more nutrition in the steps >= 9 when compared to less than 9. The reason our test statistic divides by mean is to essentially standardize each part so they all have the same weight. Hence we can now calculate our test statistic of our dataframe which gives test statstic of .257. And now we use another permutation test to randomize which data points are less than 9 steps and greater than or equal to 9 steps to get a distribution of these test statistics. Plotting these statistics gives the following:

and looking at this it seems there is a significant difference between our test statistic and the randomized test statics from the permutation test. And when we find the probability of our test statstic to happen, we find that p = 0.00. Hence using a significance level of .01, we can reject the null hypothesis that there is no difference between nutrition and whether the the number of steps is greater than or equal to 9 and less than 9. Hence it is highly likely we accept the alternate hypothesis that if steps >= 9 then there a higher nutrition content when compared to less than 9 steps.