-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Background
We have two approaches to predicting ENVO terms from Google Earth embeddings:
1. k-Nearest Neighbors (Unsupervised, Lazy Learning)
Location: notebooks/similarity_analysis.ipynb Cells 18-19
Method:
- Find k=5 samples with most similar GE embeddings (cosine similarity)
- Aggregate their ENVO terms via majority vote
- Predict the most common triad
def predict_triad_from_ge(target_sample, reference_samples, k=5):
# Find k most GE-similar samples
top_k = sorted(similarities, key=lambda x: x[0], reverse=True)[:k]
# Majority vote for each scale
predicted_broad = Counter([s['env_broad_scale'] for s in top_k]).most_common(1)[0][0]
# ... same for local, mediumPros:
- Simple, interpretable
- No training needed
- Works for any sample (even with 0 training data)
Cons:
- Slow at inference (must compare to all samples)
- No feature weighting (treats all GE dims equally)
- Sensitive to k choice
2. Random Forest (Supervised, Eager Learning)
Location: notebooks/random_forest_envo_prediction.ipynb (Issue #49)
Method:
- Train on labeled samples (X=GE embeddings, y=ENVO terms)
- Learn decision boundaries in 64-dim space
- Fast inference on new samples
Pros:
- Fast at inference
- Learns feature importance (which GE dims matter)
- Handles non-linear relationships
Cons:
- Requires training data
- Black box (less interpretable than k-NN)
- Can overfit
Research Question
Does supervised learning (RF) beat unsupervised baseline (k-NN)?
If RF only marginally better → k-NN sufficient (simpler, no training)
If RF significantly better → Justifies ML approach
Evaluation Plan
Metrics to Compare
-
Accuracy (exact match)
k_nn_accuracy = (k_nn_predictions == true_labels).mean() rf_accuracy = (rf_predictions == true_labels).mean()
-
Hierarchical accuracy (from Issue Implement ontology-aware evaluation metrics for ENVO predictions (partial credit for parent/child terms) #51)
- Use ontology-aware scoring
- Does k-NN get "close misses" more often?
-
Prediction confidence
- k-NN: Vote proportion (e.g., 4/5 neighbors agree = 0.8)
- RF: Max probability from
.predict_proba()
-
Per-class performance
- Which ENVO terms does each method handle better?
- k-NN better for common terms? RF better for rare?
-
Runtime
- k-NN inference: O(n) - must compare to all samples
- RF inference: O(log n × trees) - much faster
Experimental Setup
Test on same data:
- Use NMDC complete dataset (8,121 samples)
- Same train/test split for both methods
- Same random seed for reproducibility
k-NN configuration:
# For each test sample
for test_sample in X_test:
# Find k=5 most similar training samples by GE embedding
similarities = cosine_similarity(test_sample, X_train)
top_k_indices = np.argsort(similarities)[-5:]
# Majority vote
predicted_broad = mode(y_train_broad[top_k_indices])
# ... same for local, mediumRF configuration (from #49):
rf_broad = RandomForestClassifier(n_estimators=100, max_depth=10)
rf_broad.fit(X_train, y_train_broad)
predicted_broad = rf_broad.predict(X_test)Comparison Table
| Metric | k-NN (k=5) | Random Forest | Winner |
|---|---|---|---|
| broad_scale accuracy | ? | ~55% | ? |
| local_scale accuracy | ? | ~53% | ? |
| medium accuracy | ? | ~50% | ? |
| Hierarchical accuracy | ? | ? | ? |
| Avg confidence | ? | ~20% | ? |
| Inference time (1000 samples) | ? | ? | ? |
| Training time | 0s | ~60s | k-NN |
Implementation Tasks
- Extract k-NN prediction code from similarity_analysis.ipynb
- Create reusable function:
predict_knn_triad(X_test, X_train, y_train, k=5) - Add k-NN comparison section to RF notebook (after RF training)
- Run both methods on same test set
- Compute all comparison metrics
- Visualizations:
- Accuracy comparison bar chart
- Confidence distribution comparison
- Confusion matrices side-by-side
- Error analysis: which samples does each get right/wrong?
- Document findings in notebook markdown
Expected Results
Hypothesis: RF should outperform k-NN by 5-10% accuracy
Scenario 1 (RF wins clearly):
- RF accuracy: 53%
- k-NN accuracy: 43%
- Conclusion: ML justified, worth training cost
Scenario 2 (RF wins marginally):
- RF accuracy: 53%
- k-NN accuracy: 50%
- Conclusion: Debatable - k-NN simpler, RF faster at inference
Scenario 3 (Tie):
- RF accuracy: 53%
- k-NN accuracy: 53%
- Conclusion: Use k-NN (no training needed)
Scenario 4 (k-NN wins - surprising!):
- k-NN accuracy: 55%
- RF accuracy: 53%
- Conclusion: GE embeddings designed for similarity, not classification
Hyperparameter Tuning
If k-NN performs well, try different k:
for k in [3, 5, 7, 10, 15, 20]:
k_nn_predictions = predict_knn_triad(X_test, X_train, y_train, k=k)
accuracy = (k_nn_predictions == y_test).mean()
print(f"k={k}: accuracy={accuracy:.3f}")Optimal k balances:
- Small k (3): Less noise, more variance
- Large k (20): More noise, less variance
Related
- Apply Random Forest ENVO predictor to NMDC complete dataset (8,121 samples) #49 - RF training (provides baseline)
- Implement ontology-aware evaluation metrics for ENVO predictions (partial credit for parent/child terms) #51 - Ontology-aware eval (apply to both methods)
- similarity_analysis.ipynb - k-NN implementation
Priority
Medium-High - Should be part of #49 work or immediate follow-up.
Provides scientific validation: "Is ML worth it?"
Notes
Hybrid approach (future work):
- Use k-NN for high-confidence predictions (5/5 neighbors agree)
- Use RF for low-confidence k-NN predictions
- Best of both worlds?