Compare RF predictions to k-NN baseline from similarity_analysis.ipynb

## Background

We have **two approaches** to predicting ENVO terms from Google Earth embeddings:

### 1. k-Nearest Neighbors (Unsupervised, Lazy Learning)
**Location**: `notebooks/similarity_analysis.ipynb` Cells 18-19

**Method**: 
1. Find k=5 samples with most similar GE embeddings (cosine similarity)
2. Aggregate their ENVO terms via majority vote
3. Predict the most common triad

```python
def predict_triad_from_ge(target_sample, reference_samples, k=5):
    # Find k most GE-similar samples
    top_k = sorted(similarities, key=lambda x: x[0], reverse=True)[:k]
    
    # Majority vote for each scale
    predicted_broad = Counter([s['env_broad_scale'] for s in top_k]).most_common(1)[0][0]
    # ... same for local, medium
```

**Pros**: 
- Simple, interpretable
- No training needed
- Works for any sample (even with 0 training data)

**Cons**:
- Slow at inference (must compare to all samples)
- No feature weighting (treats all GE dims equally)
- Sensitive to k choice

### 2. Random Forest (Supervised, Eager Learning)
**Location**: `notebooks/random_forest_envo_prediction.ipynb` (Issue #49)

**Method**:
1. Train on labeled samples (X=GE embeddings, y=ENVO terms)
2. Learn decision boundaries in 64-dim space
3. Fast inference on new samples

**Pros**:
- Fast at inference
- Learns feature importance (which GE dims matter)
- Handles non-linear relationships

**Cons**:
- Requires training data
- Black box (less interpretable than k-NN)
- Can overfit

## Research Question

**Does supervised learning (RF) beat unsupervised baseline (k-NN)?**

If RF only marginally better → k-NN sufficient (simpler, no training)
If RF significantly better → Justifies ML approach

## Evaluation Plan

### Metrics to Compare

1. **Accuracy** (exact match)
   ```python
   k_nn_accuracy = (k_nn_predictions == true_labels).mean()
   rf_accuracy = (rf_predictions == true_labels).mean()
   ```

2. **Hierarchical accuracy** (from Issue #51)
   - Use ontology-aware scoring
   - Does k-NN get "close misses" more often?

3. **Prediction confidence**
   - k-NN: Vote proportion (e.g., 4/5 neighbors agree = 0.8)
   - RF: Max probability from `.predict_proba()`

4. **Per-class performance**
   - Which ENVO terms does each method handle better?
   - k-NN better for common terms? RF better for rare?

5. **Runtime**
   - k-NN inference: O(n) - must compare to all samples
   - RF inference: O(log n × trees) - much faster

### Experimental Setup

**Test on same data**:
- Use NMDC complete dataset (8,121 samples)
- Same train/test split for both methods
- Same random seed for reproducibility

**k-NN configuration**:
```python
# For each test sample
for test_sample in X_test:
    # Find k=5 most similar training samples by GE embedding
    similarities = cosine_similarity(test_sample, X_train)
    top_k_indices = np.argsort(similarities)[-5:]
    
    # Majority vote
    predicted_broad = mode(y_train_broad[top_k_indices])
    # ... same for local, medium
```

**RF configuration** (from #49):
```python
rf_broad = RandomForestClassifier(n_estimators=100, max_depth=10)
rf_broad.fit(X_train, y_train_broad)
predicted_broad = rf_broad.predict(X_test)
```

### Comparison Table

| **Metric** | **k-NN (k=5)** | **Random Forest** | **Winner** |
|------------|----------------|-------------------|------------|
| broad_scale accuracy | ? | ~55% | ? |
| local_scale accuracy | ? | ~53% | ? |
| medium accuracy | ? | ~50% | ? |
| Hierarchical accuracy | ? | ? | ? |
| Avg confidence | ? | ~20% | ? |
| Inference time (1000 samples) | ? | ? | ? |
| Training time | 0s | ~60s | k-NN |

## Implementation Tasks

- [ ] Extract k-NN prediction code from similarity_analysis.ipynb
- [ ] Create reusable function: `predict_knn_triad(X_test, X_train, y_train, k=5)`
- [ ] Add k-NN comparison section to RF notebook (after RF training)
- [ ] Run both methods on same test set
- [ ] Compute all comparison metrics
- [ ] Visualizations:
  - [ ] Accuracy comparison bar chart
  - [ ] Confidence distribution comparison
  - [ ] Confusion matrices side-by-side
  - [ ] Error analysis: which samples does each get right/wrong?
- [ ] Document findings in notebook markdown

## Expected Results

**Hypothesis**: RF should outperform k-NN by 5-10% accuracy

**Scenario 1** (RF wins clearly):
- RF accuracy: 53%
- k-NN accuracy: 43%
- **Conclusion**: ML justified, worth training cost

**Scenario 2** (RF wins marginally):
- RF accuracy: 53%
- k-NN accuracy: 50%
- **Conclusion**: Debatable - k-NN simpler, RF faster at inference

**Scenario 3** (Tie):
- RF accuracy: 53%
- k-NN accuracy: 53%
- **Conclusion**: Use k-NN (no training needed)

**Scenario 4** (k-NN wins - surprising!):
- k-NN accuracy: 55%
- RF accuracy: 53%
- **Conclusion**: GE embeddings designed for similarity, not classification

## Hyperparameter Tuning

If k-NN performs well, try different k:
```python
for k in [3, 5, 7, 10, 15, 20]:
    k_nn_predictions = predict_knn_triad(X_test, X_train, y_train, k=k)
    accuracy = (k_nn_predictions == y_test).mean()
    print(f"k={k}: accuracy={accuracy:.3f}")
```

Optimal k balances:
- Small k (3): Less noise, more variance
- Large k (20): More noise, less variance

## Related

- #49 - RF training (provides baseline)
- #51 - Ontology-aware eval (apply to both methods)
- similarity_analysis.ipynb - k-NN implementation

## Priority

**Medium-High** - Should be part of #49 work or immediate follow-up.

Provides scientific validation: "Is ML worth it?"

## Notes

**Hybrid approach** (future work):
- Use k-NN for high-confidence predictions (5/5 neighbors agree)
- Use RF for low-confidence k-NN predictions
- Best of both worlds?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compare RF predictions to k-NN baseline from similarity_analysis.ipynb #56

Background

1. k-Nearest Neighbors (Unsupervised, Lazy Learning)

2. Random Forest (Supervised, Eager Learning)

Research Question

Evaluation Plan

Metrics to Compare

Experimental Setup

Comparison Table

Implementation Tasks

Expected Results

Hyperparameter Tuning

Related

Priority

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	k-NN (k=5)	Random Forest	Winner
broad_scale accuracy	?	~55%	?
local_scale accuracy	?	~53%	?
medium accuracy	?	~50%	?
Hierarchical accuracy	?	?	?
Avg confidence	?	~20%	?
Inference time (1000 samples)	?	?	?
Training time	0s	~60s	k-NN

Compare RF predictions to k-NN baseline from similarity_analysis.ipynb #56

Description

Background

1. k-Nearest Neighbors (Unsupervised, Lazy Learning)

2. Random Forest (Supervised, Eager Learning)

Research Question

Evaluation Plan

Metrics to Compare

Experimental Setup

Comparison Table

Implementation Tasks

Expected Results

Hyperparameter Tuning

Related

Priority

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions