How I would run, read, and learn from a quarter of experiments at a consumer SaaS company.
Live Dashboard: View on GitHub Pages →
Most A/B testing portfolios show one thing: "I can analyze a single experiment correctly."
That's table stakes. The harder — and more valuable — question is: how does an organization run dozens of experiments a year without stepping on itself, learn from what doesn't ship, and avoid the traps that turn surface wins into long-term losses?
This project answers that question. It shows how I would lead the experimentation function at a consumer SaaS company — not just as an analyst running tests, but as a partner helping the organization decide which tests matter, how to read them honestly, and what to do with results that don't fit a clean narrative.
This is not an A/B test analysis project. It's a design for how an analytics function should operate an experimentation program at scale — covering governance, analytical rigor, stakeholder communication, and decision-quality frameworks.
- A framework for running experimentation as an ongoing program, not a series of one-off tests
- A portfolio of 6 experiments chosen to illustrate the recurring patterns a real growth team encounters (Simpson's paradox, novelty decay, underpowered tests, primary vs. north star metric conflicts)
- An operating model with pre-test gates, during-test monitoring, post-test decision rules, and anti-patterns — the kind of playbook a mature experimentation function runs on
- A reusable toolset (Experiment Designer, Governance Framework, per-experiment analysis template) that's scalable across many tests, not hand-crafted for each
- Not a single A/B test analysis. That's table stakes; this project assumes the reader already expects that skill and shows the layer above it.
- Not a production experimentation platform. Platforms handle assignment services, feature flags, streaming pipelines, and access control — that's Engineering work. This project shows the analytical and governance layer that operates on top of a platform.
- Not a demonstration of handling real-world data-layer problems (instrumentation bugs, SRM root-cause investigation, cross-platform event schema issues, bot filtering). Those consume ~80% of the real experimentation workflow, but they're best demonstrated through production work history, not portfolio simulations. My years at Walmart Connect and Google Play involved exactly this kind of data-reliability work — here I'm deliberately focusing on the judgment layer, which is the hardest to demonstrate through code and the most differentiating at senior levels.
When a PM says "We got +12%, can we ship?", the most valuable person in the room is the one who asks:
- What does the D30 retention look like?
- Did we check SRM?
- Is this a new-user effect or across all segments?
- Was this running concurrent with another test?
That capability — being the calm, rigorous voice in a fast-moving growth team — is what this project is built to show.
LinguaLeap is a simulated freemium language learning app (~2M MAU, Duolingo-style business model) used as the backdrop for this project. The portfolio reflects what a realistic quarter at a growth-stage consumer SaaS company might produce: 6 experiments across acquisition, activation, engagement, and monetization.
The analysis here is the quarterly experimentation review I would deliver to the VP of Growth — covering not just results, but what the program itself is teaching us.
| # | Experiment | Area | Raw Read | Decision | Why |
|---|---|---|---|---|---|
| E1 | Social login at signup | Acquisition | +9.2% signup completion, p<0.001 | ✅ Ship | Clean result, consistent across segments |
| E2 | Extended trial (7 → 14 days) | Monetization | Flat overall (p=0.09) | ✅ Ship to new users only | Hidden heterogeneity: +22.9% new / -8% tenured |
| E3 | Push notification time shift | Engagement | +15% week 1, +2% week 4 | ❌ Kill | Novelty effect; long-term incremental near zero |
| E4 | Shortened onboarding (5 → 3 steps) | Activation | +12% D1 activation, p<0.001 | ❌ Kill | D7 retention +1.3%, D30 retention -9.5% |
| E5 | AI conversation practice | Engagement | +2.6% engagement, p=0.13 | 🔁 Re-run | Underpowered test, not a null result |
| E6 | Leaderboard gamification | Engagement | +22% DAU, p<0.001 | ✅ Ship | Clean result validated via holdout |
Ship rate: 33% (2 unconditional ships + 1 segment-level ship, out of 6 experiments).
A team shipping 80%+ of experiments isn't experimenting — they're rubber-stamping. A team shipping under 10% is over-investing in low-impact tests. 33% is the kind of rate a mature program sustains over time.
Every experiment in this portfolio is analyzed with:
- Pre-test: Sample size & MDE calculation, SRM check, concurrent test conflict review
- In-test: Guardrail metric monitoring (no peeking on primary)
- Post-test: Primary metric significance testing (frequentist + Bayesian), variance reduction via CUPED where applicable, segment-level analysis, guardrail metric review
Standard stuff. Necessary but not sufficient.
Three of the six experiments changed decision after segment analysis:
- E2 looked flat overall — but new users loved it (+22.9%) and tenured users hated it (-8%). Shipping to everyone would have been a wash; shipping to new users only is a clear win.
- E4 had the strongest primary metric signal in the portfolio (+12% D1 activation) but killed long-term retention. Primary metric wins can mask North Star losses.
- E3's weekly decomposition revealed the time-shift effect was concentrated in weeks 1–2 and reverted by week 4 — classic novelty.
The job isn't to measure the average effect. It's to find where the effect lives.
The real output of this quarter isn't the ship decisions. It's the pattern across tests:
- Two out of six (E3, E4) would have degraded long-term health if shipped on primary metric alone → policy: activation/engagement tests must validate against D30 retention before ship
- One of six (E2) required segment analysis to find the right answer → segment analysis shouldn't be optional; it should be standard
- One of six (E5) was under-designed → MDE discipline at test design stage needs tightening
These aren't post-hoc observations. They're the inputs to next quarter's experimentation policy. That's what makes it a program, not a series of tests.
Every experiment has a one-page brief answering three questions a VP actually cares about:
- What did we learn? (not "what did the p-value say")
- What should we do? (with confidence level)
- What would change our minds? (the falsifiable conditions for reversing the decision)
1. Program Dashboard — Quarter-level view: 6 experiments, ship/kill/re-run breakdown, cumulative impact, cross-experiment learnings.
2. Experiment Gallery — Per-experiment deep dive: hypothesis, design, raw results, deeper analysis, decision, and learnings — structured identically for each so patterns across tests become visible.
3. Cookie Cats Deep Dive — A standalone analysis of the classic Cookie Cats mobile game gate experiment using the real public dataset (~90K players, Kaggle CC0). Shows hands-on analytical depth on real data: retention curves, statistical testing, bootstrap confidence intervals, Bayesian posterior analysis, and segment-level uplift.
4. Experiment Designer — Interactive tool for sizing a new test: sample size, MDE, duration estimator, and — critically — a "business case" layer that translates statistical parameters into expected business value. Most calculators tell you how many users you need. This one also tells you whether the test is worth running.
5. Trust & Governance Framework — The operating model behind the program: pre-test gates, during-test monitoring, post-test decision rules, and a list of anti-patterns we don't allow.
6. Executive Briefing — The one-page version for the VP of Growth: what we shipped, what we learned, what we're investing in next quarter, and what the program needs from leadership.
The technical stack is intentionally appropriate, not intentionally complex. Every method used is justified by the analytical question, not chosen to show range.
Statistical methods:
- Frequentist hypothesis testing (two-proportion z-test, Welch's t-test)
- Bayesian A/B analysis (Beta-Binomial conjugate) — paired with frequentist results, not replacing them
- CUPED variance reduction (applied selectively, only where pre-period data is meaningful)
- Bootstrap confidence intervals (for non-parametric robustness checks on continuous metrics)
- Segment-level analysis with attention to multiple comparisons
- Post-hoc power analysis (for diagnosing underpowered tests)
Diagnostics:
- SRM (Sample Ratio Mismatch) check on every test
- Guardrail metric monitoring
- Concurrent test interaction audit
- Holdout validation for major features
- No causal forests / uplift models — the HTE analysis here is segment-based and interpretable, which is what a growth team can actually action
- No sequential testing / mSPRT — interesting academically but overkill for 2–4 week tests
- No ML-based heterogeneity estimation — the business doesn't need a black box; it needs a defensible recommendation
- No full experimentation platform engineering — randomization services, feature flags, streaming analytics are platform/infra work, not analytics work
The signal of analytical maturity isn't using every tool you know — it's knowing which tool the situation calls for.
experimentation-playbook/
├── model/
│ ├── generate_experiment_data.py # Simulates 6 LinguaLeap experiments (Cookie Cats is real data)
│ ├── analyze_experiments.py # Runs the full analysis suite
│ └── cookie_cats_analysis.py # Dedicated deep-dive on the real-data case
├── dashboard/
│ └── experimentation_dashboard.jsx # Interactive React dashboard (6 tabs)
├── data/
│ ├── experiments/ # Generated CSVs (one per experiment)
│ └── results/ # Analysis output JSONs
├── notebooks/
│ └── cookie_cats_deep_dive.ipynb # Narrative analysis walkthrough (real data, business-focused)
├── charts/ # Static PNG charts (one per experiment + Cookie Cats)
├── docs/ # GitHub Pages: index.html + experimentation_dashboard.jsx (CDN React/Recharts + Babel)
├── methodology.md # Technical methodology deep-dive
├── requirements.txt
└── README.md
Key files:
notebooks/cookie_cats_deep_dive.ipynb— Step-by-step business analysis of the real Cookie Cats dataset (90K players)model/generate_experiment_data.py— Data generation with ground-truth effects and embedded teaching momentsmodel/analyze_experiments.py— Analysis pipeline with methods tailored per experimentmethodology.md— Full statistical methodology
# Install dependencies
pip install -r requirements.txt
# Generate data (one-time)
python model/generate_experiment_data.py
# Run full analysis
python model/analyze_experiments.py
python model/cookie_cats_analysis.py
# Generate charts
python model/generate_charts.py
python model/cookie_cats_charts.pyThe React dashboard is a single .jsx file that can be rendered in any React environment. Dependencies: react, recharts.
Note on dashboard data: The dashboard figures are static snapshots derived from
data/results/*.json. This is a deliberate design choice for GitHub Pages (no backend). If you re-run the analysis scripts with a different seed or dataset, update the embedded constants inexperimentation_dashboard.jsxto match.
- The 6 experiments are simulated, with ground-truth effects set during data generation. This is for methodology demonstration — in production, you don't know the ground truth, which makes experimentation much harder and judgment much more valuable.
- Cookie Cats uses the real public dataset (90,189 players, Kaggle CC0) — this is genuine data with real distributional quirks, not a simulation. The other 6 LinguaLeap experiments are simulated with deliberate ground-truth effects for methodology demonstration.
- Segment definitions are predefined in the simulation. In reality, finding the right segments is half the battle, and doing it post-hoc raises multiple-testing concerns. The segments here are chosen for clarity of teaching, not optimized via search.
- No causal inference beyond experimentation — quasi-experimental methods (DiD, synthetic control, RDD) for cases where experimentation isn't feasible are important in practice but out of scope here.
- This is one quarter's worth of experiments. Real program maturity takes years of accumulated learnings. What's shown here is what quarter 4 or 5 of an experimentation program might look like, not year 1.
If you're evaluating analytical skill: see the Cookie Cats Deep Dive and the E2 / E4 / E5 experiment analyses.
If you're evaluating framework thinking: see the Program Dashboard and Trust & Governance Framework.
If you're evaluating stakeholder communication: see the Executive Briefing and any per-experiment one-pager in the Gallery.
Freena Wang — Senior Marketing & Product Analytics Professional
Built as a portfolio project demonstrating end-to-end experimentation program capabilities: from designing individual tests through program-level governance and executive communication.
Companion project to ShopNova Marketing Mix Model, which covers the aggregate-measurement side of growth analytics.
Together, the two projects span the full growth measurement stack: MMM answers "where should marketing invest?", experimentation answers "what should product change?"