Skip to content

This project uses R and Machine Learning to predict recipe popularity. An end-to-end analysis, from data cleaning to a Random Forest model, achieved 80.6% precision, surpassing the business goal. The outcome is a list of 536 high-traffic recipes and strategic insights for content and marketing, demonstrating how to drive business value with data.

Notifications You must be signed in to change notification settings

Cap-alfaMike/tasty_bytes_RandomForest_businessAnalysis

Repository files navigation

Tasty Bytes: Predicting High-Traffic Recipes

1. Project Overview

Business Problem

The Product Manager for Recipe Discovery is tasked with selecting recipes to feature on the homepage to maximize user engagement. The process is currently subjective, leading to inconsistent traffic and potentially missed opportunities. The core business question is: "Can we use data science to reliably predict which recipes will be popular before we feature them?"

Business Goal

The primary objective, set by the Head of Data Science, was to develop a model that could predict which recipes will be popular 80% of the time (specifically, Precision) while minimizing the chance of showing unpopular recipes. This data-driven approach aims to increase site traffic, improve user experience, and provide a strategic tool for the content team.


2. Data Validation and Cleaning

The foundation of any reliable model is clean, well-understood data. The process involved several critical steps to transform the raw data into a usable format.

Initial Data State

The initial dataset contained 947 recipes. An immediate check revealed several data quality issues that needed to be addressed, primarily a significant number of missing values (NAs).

    --- Checking for Missing Values (NA) in the Original Dataset ---

    [1] "NAs in 'calories': 52"

    [1] "NAs in 'carbohydrate': 52"

    [1] "NAs in 'sugar': 52"

    [1] "NAs in 'protein': 52"

    [1] "NAs in 'servings': 0"

    [1] "NAs in 'category': 0"

    [1] "NAs in 'high_traffic': 373"

Cleaning Process and Rationale

  1. Target Variable (high_traffic): The most significant issue was the 373 missing values in our target variable. A crucial business assumption was made: if a recipe's traffic isn't explicitly marked as 'High', it is considered 'Low'. This allowed us to convert the NA values to 'Low', transforming the problem into a binary classification task.

  2. Predictor Variables:

    • The servings column was cleaned of extraneous text (e.g., " as a snack") and converted to a numeric type.
    • The 52 rows with missing nutritional information (calories, protein, etc.) were removed to ensure the model would not be trained on incomplete data.

Final Data State

After the complete cleaning process, we were left with a robust dataset of 895 recipes, with a class distribution of 535 'High' and 360 'Low' traffic recipes. This clean dataset formed the basis for all subsequent analysis and modeling.


3. Exploratory Data Analysis (EDA)

With a clean dataset, we performed an exploratory analysis to uncover initial patterns and insights.

Numerical Summary

A statistical summary provided a quantitative overview of our data's distribution, showing the range, mean, and quartiles for each numerical feature.

Variable vars n mean sd median trimmed mad min max range skew kurtosis se
recipe* 1 895 448.00 258.51 448.00 448.00 332.10 1.00 895.00 894.00 0.00 -1.20 8.64
calories 2 895 435.94 453.02 288.55 356.89 317.26 0.14 3633.16 3633.02 2.03 6.02 15.14
carbohydrate 3 895 35.07 43.95 21.48 26.62 23.20 0.03 530.42 530.39 3.74 25.40 1.47
sugar 4 895 9.05 14.68 4.55 5.79 4.95 0.01 148.75 148.74 4.21 24.17 0.49
protein 5 895 24.15 36.37 10.80 16.35 14.14 0.00 363.36 363.36 3.50 18.04 1.22
category* 6 895 5.77 3.23 6.00 5.73 4.45 1.00 11.00 10.00 0.10 -1.26 0.11
servings 7 895 3.46 1.74 4.00 3.45 2.97 1.00 6.00 5.00 0.01 -1.19 0.06
high_traffic* 8 895 1.40 0.49 1.00 1.38 0.00 1.00 2.00 1.00 0.40 -1.84 0.02

Visual Analysis

Visualizations helped to reveal relationships between variables.

  • Distribution of Calories: the histogram shows that most recipes are under 1000 calories, but there is a long tail of very high-calorie recipes, indicating potential outliers.

    Distribution of Calories

  • Recipe Categories: the bar chart shows that 'Breakfast' and 'Chicken Breast' are the most frequent categories in our dataset.

    Recipe Categories

  • Calories vs. Traffic: the boxplot indicates that while the median calories are similar for both high and low traffic recipes, high-traffic recipes have a much wider range and more high-calorie outliers. This suggests that a simple rule based on calories alone is not sufficient to predict popularity.

    Calories vs. Traffic


4. Model Development and Evaluation

Problem Formulation

The task is a supervised binary classification problem. We are training a model on a labeled dataset (features like calories, category; label is High/Low traffic) to predict the label for other recipes.

Model Selection Rationale

Two models were built to compare a simple approach against a more complex one.

  1. Baseline Model (Logistic Regression): chosen for its simplicity and interpretability. It attempts to find a linear relationship between the features and the outcome. As the EDA suggested, this was unlikely to be sufficient. The model performed poorly (Accuracy: 39.1%), confirming that the relationship between a recipe's characteristics and its popularity is not linear.

  2. Comparison Model (Random Forest): chosen for its high performance and ability to capture complex, non-linear relationships. A Random Forest operates like a "committee of experts" by building hundreds of individual decision trees and aggregating their votes. This "wisdom of the crowd" approach makes it highly effective.

Model Evaluation

Logistic Regression Model Performance was poor. The baseline confirmed that a simple linear approach was insufficient (Accuracy: 39.1%).

Reference: Low Reference: High
Prediction: Low 66 103
Prediction: High 6 4

The Random Forest, which builds a "committee" of decision trees, proved far more effective at capturing the complex patterns in the data. It's use was a resounding success. The evaluation on the test set yielded the following results:

Reference: Low Reference: High
Prediction: Low 51 20
Prediction: High 21 87

Key Performance Metrics

Metric Value Interpretation
Accuracy 77.1% Overall, how often is the model correct?
Precision (Pos Pred Value) 80.6% Of recipes predicted 'High', how many were correct? (Our business KPI)
Sensitivity (Recall) 81.3% Of all actual 'High' recipes, how many did the model find?
Specificity 70.8% Of all actual 'Low' recipes, how many did the model find?
Kappa 0.5226 Measures agreement vs. chance. A value > 0.4 is considered moderate.

The Business Metric: Precision

For this project, the most critical metric is Precision (Positive Predictive Value). It answers the question: "Of all the recipes our model predicted as 'High' traffic, what percentage were actually 'High' traffic?" This directly aligns with the business goal of minimizing the chance of showing unpopular recipes.

The model achieved a Precision of 80.6%, successfully exceeding the 80% business target.


5. Answering the Business Question & Strategic Insights

Tactical Answer: Which recipes will lead to high traffic?

The model was used to predict the traffic for all 895 recipes in the clean dataset. It identified a concrete list of 536 recipes with high potential for generating traffic. This list is the immediate, actionable answer to the Product Manager's request.

Strategic Answer 1: What kind of recipes should we create more of?

By analyzing the categories of the 536 recipes the model favored, we can extract strategic insights for future content development. The model showed a clear preference for certain types of recipes.

Top 5 Popular Categories

The top categories predicted as popular are Potato, Vegetable, and Pork. This suggests that users engage most with savory, foundational dishes. This insight allows the content team to move from being reactive to proactive, focusing their efforts on creating content that is statistically likely to succeed.

Strategic Answer 2: Why are these recipes popular?

To understand the key drivers of popularity, we analyzed the feature importance from the Random Forest model.

Key Drivers of Recipe Popularity

The results are clear: category is the single most important predictor, nearly twice as influential as any nutritional component. This means that what a dish is (e.g., a Potato dish, a Pork dish) is far more predictive of its popularity than its specific nutritional profile.

Deeper Insight: The Health-Conscious User Persona

The fact that the macronutrients (protein, calories, sugar, carbohydrate) form a tight cluster of secondary importance strongly suggests that our user base is nutritionally aware and likely motivated by health and fitness goals. While category drives initial interest, the nutritional profile is a significant secondary factor for our engaged users.


6. Final Recommendations

Based on the analysis, we propose the following data-driven recommendations:

  1. Implement (Tactical): immediately begin promoting the 536 recipes identified by the Random Forest model on the homepage to leverage their high potential for engagement.

  2. Strategize (Long-Term Content): direct the content creation team to focus on developing new recipes within the top-performing categories: Potato, Vegetable, Pork, Lunch/Snacks, and Meat.

  3. Operationalize (Process): integrate the trained Random Forest model into the editorial workflow. New recipes can be scored for their "popularity potential" before publication, aiding in promotion decisions from day one.

  4. Monitor (KPI): adopt Precision as the key performance indicator for this initiative. Monitor it weekly to ensure model performance remains high and retrain the model periodically as new data becomes available.

  5. Market (Growth): develop a segmented marketing strategy to capitalize on the identified health-conscious user segment. Launch targeted campaigns using fitness- and diet-related keywords (e.g., "high protein," "low carb") to attract this valuable audience directly to relevant recipes.

About

This project uses R and Machine Learning to predict recipe popularity. An end-to-end analysis, from data cleaning to a Random Forest model, achieved 80.6% precision, surpassing the business goal. The outcome is a list of 536 high-traffic recipes and strategic insights for content and marketing, demonstrating how to drive business value with data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages