Predicting Developer Productivity Outcomes from AI Coding Assistants

Berkeley Professional Certificate in Machine Learning and Artificial Intelligence

Capstone — Module 20.1: Initial Report and EDA

Problem Overview

Can we predict which developers will experience actual productivity gains from AI coding assistants based on observable characteristics, and what factors explain the significant gap between perceived and measured productivity improvements?

Recent research reveals a striking paradox: experienced developers using AI tools complete tasks 19% slower while believing they are 20% faster—a 40-percentage-point perception-reality gap. Meanwhile, industry surveys show 90% adoption and over 80% of developers perceiving productivity increases. This project will use machine learning to identify the developer profiles, organizational contexts, and usage patterns that predict genuine productivity outcomes, enabling organizations to optimize their AI tool investments and adoption strategies.

Datasets

Data Source	Key Variables	Purpose
Stack Overflow Developer Survey 2025 (~49,000 respondents) https://www.kaggle.com/datasets/edoardogalli/stack-overflow-annual-developer-survey-2025	AI tools used, perceived productivity, trust ratings, frustration indicators, experience level, languages, job role, company size	Primary dataset for modeling perception vs. outcome proxies
METR 2025 RCT (246 tasks, 16 developers)	Predicted vs. actual completion times, AI treatment assignment, task familiarity	Ground truth for the perception-reality gap

Proposed Techniques

1. Logistic Regression with Regularization (L1/L2)

Classify developers into "likely to benefit" vs. "unlikely to benefit" categories based on survey-reported outcomes and observable characteristics
Use regularization to handle high-dimensional feature space and identify the most predictive factors
Interpretable coefficients support clear business recommendations

2. Random Forest / Gradient Boosting Classification

Feature importance analysis to rank which developer and organizational characteristics most strongly predict the perception-reality gap
Handle non-linear relationships and interactions between variables (e.g., experience level × codebase complexity)
Compare performance against regularized logistic regression baseline

3. K-Means / Hierarchical Clustering

Segment developers into distinct personas based on AI usage patterns, experience profiles, and reported outcomes
Identify natural groupings that may reveal "AI power users" vs. "AI-slowed developers" vs. "neutral" archetypes
Support targeted adoption recommendations for different developer segments

Expected Business Value

The analysis will produce actionable recommendations for technology leaders:

Targeting: Which developer profiles should be prioritized for AI tool licenses?
ROI Estimation: What productivity impact should organizations realistically expect by developer segment?
Adoption Strategy: What organizational factors (training, platform maturity, team practices) amplify or diminish AI tool effectiveness?
Risk Mitigation: Which contexts show negative productivity impact, suggesting caution before widespread rollout?

Key Findings (Module 20.1 — EDA & Baseline)

1. The Perception-Reality Gap Is Confirmed

METR RCT analysis shows developers consistently overestimate AI benefits. The average developer predicted ~24% speedup but experienced negative actual speedup across most task types and familiarity levels.

2. AI Adoption Is Near-Universal but Trust Lags

~49% of active developers use AI tools daily
Only ~33% express trust in AI output accuracy
Senior developers (20+ years) show higher rates of non-adoption and unfavorable sentiment

3. Baseline Model Performance

Logistic Regression with L2 regularization and balanced class weights:

Metric	Score
F1-Score (test)	0.508
ROC-AUC (test)	0.858
5-Fold CV F1	0.520 +/- 0.007

The model achieves 84% recall on the positive class (genuine AI impact) with 36% precision, indicating good discrimination but room for improvement in reducing false positives.

4. Top Predictive Features (Logistic Regression Coefficients)

Feature	Coefficient	Direction
AI Sentiment (encoded)	+1.30	Favorable sentiment predicts impact
Daily AI Usage	+0.82	Frequent use predicts impact
AI Trust (encoded)	+0.36	Higher trust predicts impact
Uses Cursor IDE	+0.22	AI-native tooling predicts impact
Lost Confidence	+0.19	Surprising positive association
Years Coding	-0.17	More experience slightly reduces predicted impact

Notebook

The full analysis is in capstone_eda_baseline.ipynb, containing:

Data loading, inspection, and cleaning (27,852 active developers retained)
Feature engineering (58 encoded features + 10 model features)
14 visualizations across METR RCT and SO datasets
Logistic Regression baseline with cross-validation

Next Steps (Module 24)

Random Forest and Gradient Boosting for comparison
K-Means clustering for developer persona segmentation
Hyperparameter tuning with GridSearchCV

Repository Structure

capstone-eda/
├── README.md
├── capstone_eda_baseline.ipynb    # Main analysis notebook
└── data/
    ├── Stack-Overflow-2025.zip    # SO 2025 survey data
    └── metr_data_complete.csv     # METR RCT data

Key Research References

METR (2025). "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity"
Stack Overflow (2025). "2025 Developer Survey"
GitClear (2025). "AI Copilot Code Quality Research 2025"
Pragmatic Engineer (2025). "Cursor makes developers less effective?"

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
Capstone_Problem_Statement.md		Capstone_Problem_Statement.md
README.md		README.md
capstone_eda_baseline.ipynb		capstone_eda_baseline.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Developer Productivity Outcomes from AI Coding Assistants

Problem Overview

Datasets

Proposed Techniques

1. Logistic Regression with Regularization (L1/L2)

2. Random Forest / Gradient Boosting Classification

3. K-Means / Hierarchical Clustering

Expected Business Value

Key Findings (Module 20.1 — EDA & Baseline)

1. The Perception-Reality Gap Is Confirmed

2. AI Adoption Is Near-Universal but Trust Lags

3. Baseline Model Performance

4. Top Predictive Features (Logistic Regression Coefficients)

Notebook

Next Steps (Module 24)

Repository Structure

Key Research References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Predicting Developer Productivity Outcomes from AI Coding Assistants

Problem Overview

Datasets

Proposed Techniques

1. Logistic Regression with Regularization (L1/L2)

2. Random Forest / Gradient Boosting Classification

3. K-Means / Hierarchical Clustering

Expected Business Value

Key Findings (Module 20.1 — EDA & Baseline)

1. The Perception-Reality Gap Is Confirmed

2. AI Adoption Is Near-Universal but Trust Lags

3. Baseline Model Performance

4. Top Predictive Features (Logistic Regression Coefficients)

Notebook

Next Steps (Module 24)

Repository Structure

Key Research References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages