Tasty Bytes: Predicting High-Traffic Recipes

1. Project Overview

Business Problem

The Product Manager for Recipe Discovery is tasked with selecting recipes to feature on the homepage to maximize user engagement. The process is currently subjective, leading to inconsistent traffic and potentially missed opportunities. The core business question is: "Can we use data science to reliably predict which recipes will be popular before we feature them?"

Business Goal

The primary objective, set by the Head of Data Science, was to develop a model that could predict which recipes will be popular 80% of the time (specifically, Precision) while minimizing the chance of showing unpopular recipes. This data-driven approach aims to increase site traffic, improve user experience, and provide a strategic tool for the content team.

2. Data Validation and Cleaning

The foundation of any reliable model is clean, well-understood data. The process involved several critical steps to transform the raw data into a usable format.

Initial Data State

The initial dataset contained 947 recipes. An immediate check revealed several data quality issues that needed to be addressed, primarily a significant number of missing values (NAs).

    --- Checking for Missing Values (NA) in the Original Dataset ---

    [1] "NAs in 'calories': 52"

    [1] "NAs in 'carbohydrate': 52"

    [1] "NAs in 'sugar': 52"

    [1] "NAs in 'protein': 52"

    [1] "NAs in 'servings': 0"

    [1] "NAs in 'category': 0"

    [1] "NAs in 'high_traffic': 373"

Cleaning Process and Rationale

Target Variable (high_traffic): The most significant issue was the 373 missing values in our target variable. A crucial business assumption was made: if a recipe's traffic isn't explicitly marked as 'High', it is considered 'Low'. This allowed us to convert the NA values to 'Low', transforming the problem into a binary classification task.
Predictor Variables:
- The servings column was cleaned of extraneous text (e.g., " as a snack") and converted to a numeric type.
- The 52 rows with missing nutritional information (calories, protein, etc.) were removed to ensure the model would not be trained on incomplete data.

Final Data State

After the complete cleaning process, we were left with a robust dataset of 895 recipes, with a class distribution of 535 'High' and 360 'Low' traffic recipes. This clean dataset formed the basis for all subsequent analysis and modeling.

3. Exploratory Data Analysis (EDA)

With a clean dataset, we performed an exploratory analysis to uncover initial patterns and insights.

Numerical Summary

A statistical summary provided a quantitative overview of our data's distribution, showing the range, mean, and quartiles for each numerical feature.

Variable	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
recipe*	1	895	448.00	258.51	448.00	448.00	332.10	1.00	895.00	894.00	0.00	-1.20	8.64
calories	2	895	435.94	453.02	288.55	356.89	317.26	0.14	3633.16	3633.02	2.03	6.02	15.14
carbohydrate	3	895	35.07	43.95	21.48	26.62	23.20	0.03	530.42	530.39	3.74	25.40	1.47
sugar	4	895	9.05	14.68	4.55	5.79	4.95	0.01	148.75	148.74	4.21	24.17	0.49
protein	5	895	24.15	36.37	10.80	16.35	14.14	0.00	363.36	363.36	3.50	18.04	1.22
category*	6	895	5.77	3.23	6.00	5.73	4.45	1.00	11.00	10.00	0.10	-1.26	0.11
servings	7	895	3.46	1.74	4.00	3.45	2.97	1.00	6.00	5.00	0.01	-1.19	0.06
high_traffic*	8	895	1.40	0.49	1.00	1.38	0.00	1.00	2.00	1.00	0.40	-1.84	0.02

Visual Analysis

Visualizations helped to reveal relationships between variables.

Distribution of Calories: the histogram shows that most recipes are under 1000 calories, but there is a long tail of very high-calorie recipes, indicating potential outliers.
Recipe Categories: the bar chart shows that 'Breakfast' and 'Chicken Breast' are the most frequent categories in our dataset.
Calories vs. Traffic: the boxplot indicates that while the median calories are similar for both high and low traffic recipes, high-traffic recipes have a much wider range and more high-calorie outliers. This suggests that a simple rule based on calories alone is not sufficient to predict popularity.

4. Model Development and Evaluation

Problem Formulation

The task is a supervised binary classification problem. We are training a model on a labeled dataset (features like calories, category; label is High/Low traffic) to predict the label for other recipes.

Model Selection Rationale

Two models were built to compare a simple approach against a more complex one.

Baseline Model (Logistic Regression): chosen for its simplicity and interpretability. It attempts to find a linear relationship between the features and the outcome. As the EDA suggested, this was unlikely to be sufficient. The model performed poorly (Accuracy: 39.1%), confirming that the relationship between a recipe's characteristics and its popularity is not linear.
Comparison Model (Random Forest): chosen for its high performance and ability to capture complex, non-linear relationships. A Random Forest operates like a "committee of experts" by building hundreds of individual decision trees and aggregating their votes. This "wisdom of the crowd" approach makes it highly effective.

Model Evaluation

Logistic Regression Model Performance was poor. The baseline confirmed that a simple linear approach was insufficient (Accuracy: 39.1%).

	Reference: Low	Reference: High
Prediction: Low	66	103
Prediction: High	6	4

The Random Forest, which builds a "committee" of decision trees, proved far more effective at capturing the complex patterns in the data. It's use was a resounding success. The evaluation on the test set yielded the following results:

	Reference: Low	Reference: High
Prediction: Low	51	20
Prediction: High	21	87

Key Performance Metrics

Metric	Value	Interpretation
Accuracy	77.1%	Overall, how often is the model correct?
Precision (Pos Pred Value)	80.6%	Of recipes predicted 'High', how many were correct? (Our business KPI)
Sensitivity (Recall)	81.3%	Of all actual 'High' recipes, how many did the model find?
Specificity	70.8%	Of all actual 'Low' recipes, how many did the model find?
Kappa	0.5226	Measures agreement vs. chance. A value > 0.4 is considered moderate.

The Business Metric: Precision

For this project, the most critical metric is Precision (Positive Predictive Value). It answers the question: "Of all the recipes our model predicted as 'High' traffic, what percentage were actually 'High' traffic?" This directly aligns with the business goal of minimizing the chance of showing unpopular recipes.

The model achieved a Precision of 80.6%, successfully exceeding the 80% business target.

5. Answering the Business Question & Strategic Insights

Tactical Answer: Which recipes will lead to high traffic?

The model was used to predict the traffic for all 895 recipes in the clean dataset. It identified a concrete list of 536 recipes with high potential for generating traffic. This list is the immediate, actionable answer to the Product Manager's request.

Strategic Answer 1: What kind of recipes should we create more of?

By analyzing the categories of the 536 recipes the model favored, we can extract strategic insights for future content development. The model showed a clear preference for certain types of recipes.

The top categories predicted as popular are Potato, Vegetable, and Pork. This suggests that users engage most with savory, foundational dishes. This insight allows the content team to move from being reactive to proactive, focusing their efforts on creating content that is statistically likely to succeed.

Strategic Answer 2: Why are these recipes popular?

To understand the key drivers of popularity, we analyzed the feature importance from the Random Forest model.

The results are clear: category is the single most important predictor, nearly twice as influential as any nutritional component. This means that what a dish is (e.g., a Potato dish, a Pork dish) is far more predictive of its popularity than its specific nutritional profile.

Deeper Insight: The Health-Conscious User Persona

The fact that the macronutrients (protein, calories, sugar, carbohydrate) form a tight cluster of secondary importance strongly suggests that our user base is nutritionally aware and likely motivated by health and fitness goals. While category drives initial interest, the nutritional profile is a significant secondary factor for our engaged users.

6. Final Recommendations

Based on the analysis, we propose the following data-driven recommendations:

Implement (Tactical): immediately begin promoting the 536 recipes identified by the Random Forest model on the homepage to leverage their high potential for engagement.
Strategize (Long-Term Content): direct the content creation team to focus on developing new recipes within the top-performing categories: Potato, Vegetable, Pork, Lunch/Snacks, and Meat.
Operationalize (Process): integrate the trained Random Forest model into the editorial workflow. New recipes can be scored for their "popularity potential" before publication, aiding in promotion decisions from day one.
Monitor (KPI): adopt Precision as the key performance indicator for this initiative. Monitor it weekly to ensure model performance remains high and retrain the model periodically as new data becomes available.
Market (Growth): develop a segmented marketing strategy to capitalize on the identified health-conscious user segment. Launch targeted campaigns using fitness- and diet-related keywords (e.g., "high protein," "low carb") to attract this valuable audience directly to relevant recipes.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
renv		renv
.Rprofile		.Rprofile
.gitignore		.gitignore
Practical+-+DSP+-+Recipe+Site+Traffic+-+2212.pdf		Practical+-+DSP+-+Recipe+Site+Traffic+-+2212.pdf
README.md		README.md
barras_categorias_final.png		barras_categorias_final.png
boxplot_calorias_trafego_final.png		boxplot_calorias_trafego_final.png
feature_importance.png		feature_importance.png
histograma_calorias_final.png		histograma_calorias_final.png
recipe_site_traffic_2212.csv		recipe_site_traffic_2212.csv
renv.lock		renv.lock
tasty_bytes.R		tasty_bytes.R
top_5_categorias_populares.png		top_5_categorias_populares.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tasty Bytes: Predicting High-Traffic Recipes

1. Project Overview

Business Problem

Business Goal

2. Data Validation and Cleaning

Initial Data State

Cleaning Process and Rationale

Final Data State

3. Exploratory Data Analysis (EDA)

Numerical Summary

Visual Analysis

4. Model Development and Evaluation

Problem Formulation

Model Selection Rationale

Model Evaluation

The Business Metric: Precision

5. Answering the Business Question & Strategic Insights

Tactical Answer: Which recipes will lead to high traffic?

Strategic Answer 1: What kind of recipes should we create more of?

Strategic Answer 2: Why are these recipes popular?

Deeper Insight: The Health-Conscious User Persona

6. Final Recommendations

About

Uh oh!

Releases

Packages

Languages

Cap-alfaMike/tasty_bytes_RandomForest_businessAnalysis

Folders and files

Latest commit

History

Repository files navigation

Tasty Bytes: Predicting High-Traffic Recipes

1. Project Overview

Business Problem

Business Goal

2. Data Validation and Cleaning

Initial Data State

Cleaning Process and Rationale

Final Data State

3. Exploratory Data Analysis (EDA)

Numerical Summary

Visual Analysis

4. Model Development and Evaluation

Problem Formulation

Model Selection Rationale

Model Evaluation

The Business Metric: Precision

5. Answering the Business Question & Strategic Insights

Tactical Answer: Which recipes will lead to high traffic?

Strategic Answer 1: What kind of recipes should we create more of?

Strategic Answer 2: Why are these recipes popular?

Deeper Insight: The Health-Conscious User Persona

6. Final Recommendations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages