Skip to content

Compare RF predictions to k-NN baseline from similarity_analysis.ipynb #56

@turbomam

Description

@turbomam

Background

We have two approaches to predicting ENVO terms from Google Earth embeddings:

1. k-Nearest Neighbors (Unsupervised, Lazy Learning)

Location: notebooks/similarity_analysis.ipynb Cells 18-19

Method:

  1. Find k=5 samples with most similar GE embeddings (cosine similarity)
  2. Aggregate their ENVO terms via majority vote
  3. Predict the most common triad
def predict_triad_from_ge(target_sample, reference_samples, k=5):
    # Find k most GE-similar samples
    top_k = sorted(similarities, key=lambda x: x[0], reverse=True)[:k]
    
    # Majority vote for each scale
    predicted_broad = Counter([s['env_broad_scale'] for s in top_k]).most_common(1)[0][0]
    # ... same for local, medium

Pros:

  • Simple, interpretable
  • No training needed
  • Works for any sample (even with 0 training data)

Cons:

  • Slow at inference (must compare to all samples)
  • No feature weighting (treats all GE dims equally)
  • Sensitive to k choice

2. Random Forest (Supervised, Eager Learning)

Location: notebooks/random_forest_envo_prediction.ipynb (Issue #49)

Method:

  1. Train on labeled samples (X=GE embeddings, y=ENVO terms)
  2. Learn decision boundaries in 64-dim space
  3. Fast inference on new samples

Pros:

  • Fast at inference
  • Learns feature importance (which GE dims matter)
  • Handles non-linear relationships

Cons:

  • Requires training data
  • Black box (less interpretable than k-NN)
  • Can overfit

Research Question

Does supervised learning (RF) beat unsupervised baseline (k-NN)?

If RF only marginally better → k-NN sufficient (simpler, no training)
If RF significantly better → Justifies ML approach

Evaluation Plan

Metrics to Compare

  1. Accuracy (exact match)

    k_nn_accuracy = (k_nn_predictions == true_labels).mean()
    rf_accuracy = (rf_predictions == true_labels).mean()
  2. Hierarchical accuracy (from Issue Implement ontology-aware evaluation metrics for ENVO predictions (partial credit for parent/child terms) #51)

    • Use ontology-aware scoring
    • Does k-NN get "close misses" more often?
  3. Prediction confidence

    • k-NN: Vote proportion (e.g., 4/5 neighbors agree = 0.8)
    • RF: Max probability from .predict_proba()
  4. Per-class performance

    • Which ENVO terms does each method handle better?
    • k-NN better for common terms? RF better for rare?
  5. Runtime

    • k-NN inference: O(n) - must compare to all samples
    • RF inference: O(log n × trees) - much faster

Experimental Setup

Test on same data:

  • Use NMDC complete dataset (8,121 samples)
  • Same train/test split for both methods
  • Same random seed for reproducibility

k-NN configuration:

# For each test sample
for test_sample in X_test:
    # Find k=5 most similar training samples by GE embedding
    similarities = cosine_similarity(test_sample, X_train)
    top_k_indices = np.argsort(similarities)[-5:]
    
    # Majority vote
    predicted_broad = mode(y_train_broad[top_k_indices])
    # ... same for local, medium

RF configuration (from #49):

rf_broad = RandomForestClassifier(n_estimators=100, max_depth=10)
rf_broad.fit(X_train, y_train_broad)
predicted_broad = rf_broad.predict(X_test)

Comparison Table

Metric k-NN (k=5) Random Forest Winner
broad_scale accuracy ? ~55% ?
local_scale accuracy ? ~53% ?
medium accuracy ? ~50% ?
Hierarchical accuracy ? ? ?
Avg confidence ? ~20% ?
Inference time (1000 samples) ? ? ?
Training time 0s ~60s k-NN

Implementation Tasks

  • Extract k-NN prediction code from similarity_analysis.ipynb
  • Create reusable function: predict_knn_triad(X_test, X_train, y_train, k=5)
  • Add k-NN comparison section to RF notebook (after RF training)
  • Run both methods on same test set
  • Compute all comparison metrics
  • Visualizations:
    • Accuracy comparison bar chart
    • Confidence distribution comparison
    • Confusion matrices side-by-side
    • Error analysis: which samples does each get right/wrong?
  • Document findings in notebook markdown

Expected Results

Hypothesis: RF should outperform k-NN by 5-10% accuracy

Scenario 1 (RF wins clearly):

  • RF accuracy: 53%
  • k-NN accuracy: 43%
  • Conclusion: ML justified, worth training cost

Scenario 2 (RF wins marginally):

  • RF accuracy: 53%
  • k-NN accuracy: 50%
  • Conclusion: Debatable - k-NN simpler, RF faster at inference

Scenario 3 (Tie):

  • RF accuracy: 53%
  • k-NN accuracy: 53%
  • Conclusion: Use k-NN (no training needed)

Scenario 4 (k-NN wins - surprising!):

  • k-NN accuracy: 55%
  • RF accuracy: 53%
  • Conclusion: GE embeddings designed for similarity, not classification

Hyperparameter Tuning

If k-NN performs well, try different k:

for k in [3, 5, 7, 10, 15, 20]:
    k_nn_predictions = predict_knn_triad(X_test, X_train, y_train, k=k)
    accuracy = (k_nn_predictions == y_test).mean()
    print(f"k={k}: accuracy={accuracy:.3f}")

Optimal k balances:

  • Small k (3): Less noise, more variance
  • Large k (20): More noise, less variance

Related

Priority

Medium-High - Should be part of #49 work or immediate follow-up.

Provides scientific validation: "Is ML worth it?"

Notes

Hybrid approach (future work):

  • Use k-NN for high-confidence predictions (5/5 neighbors agree)
  • Use RF for low-confidence k-NN predictions
  • Best of both worlds?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions