-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Description
🐛 Bug Description
When using a trained model for inference on the same date, the prediction scores vary significantly depending on the test segment's end date in the dataset configuration. This makes backtesting results unreliable and inconsistent with live trading scenarios.
📋 Steps to Reproduce
Setup
- Model: GATs (Graph Attention Time Series) with LSTM base
- Dataset: TSDatasetH with Alpha158 handler
- Framework: Qlib (latest version)
- Config:
workflow_config_gats_Alpha158.yaml
Test Case
Scenario 1: Test segment = ("2024-01-01", "2025-12-05")
dataset_config['kwargs']['segments']['test'] = ("2024-01-01", "2025-12-05")
dataset = init_instance_by_config(dataset_config)
predictions = model.predict(dataset)
# Check prediction for stock SH688270 on 2025-12-05
score_1 = predictions.loc[('2025-12-05', 'SH688270')]
# Result: -0.24687856Scenario 2: Test segment = ("2024-01-01", "2025-12-31")
dataset_config['kwargs']['segments']['test'] = ("2024-01-01", "2025-12-31")
dataset = init_instance_by_config(dataset_config)
predictions = model.predict(dataset)
# Check prediction for the SAME stock on the SAME date
score_2 = predictions.loc[('2025-12-05', 'SH688270')]
# Result: 0.59764999Difference: |score_1 - score_2| = 0.8445 🔴
This is a huge difference for the exact same stock on the exact same date!
🎯 Expected Behavior
When predicting for date T, the prediction should be invariant to the test segment's end date, as long as:
- The end date is
>= T(sufficient data available) - The feature values for date
Tare identical - The model is using the same weights
Rationale: In live trading on 2025-12-05, we only have data up to that date. The backtest should simulate this exact scenario to provide reliable performance metrics.
❌ Actual Behavior
The prediction for date T changes when the test segment's end date changes, even though:
- ✅ Feature values are identical (verified)
- ✅ Label values are identical (verified)
- ✅ Sample availability is identical (verified)
- ✅ Normalization parameters are identical (verified with
fit_start_timeandfit_end_time)
This suggests the test segment range affects some internal state or processing in TSDatasetH or the data loading pipeline.
🔍 Investigation Results
What We Verified
- Handler-level features (via
handler.fetch()): ✅ Identical - Handler normalization (
RobustZScoreNormwith fixedfit_start_time/fit_end_time): ✅ Identical - Sample availability on target date: ✅ Identical (same stocks)
- Learn processors (like
CSRankNorm): ✅ Disabled for inference
📊 Additional Context
Configuration Example:
dataset:
class: TSDatasetH
kwargs:
handler:
class: Alpha158
kwargs:
start_time: "2008-01-01"
end_time: "2025-12-31"
fit_start_time: "2000-01-01" # Fixed normalization range
fit_end_time: "2020-12-31" # Fixed normalization range
instruments: "all"
infer_processors:
- class: RobustZScoreNorm
kwargs:
fit_start_time: "2000-01-01"
fit_end_time: "2020-12-31"
fields_group: feature
- class: Fillna
learn_processors: [] # Disabled for inference
segments:
train: ["2000-01-01", "2020-12-31"]
valid: ["2021-01-01", "2023-12-31"]
test: ["2024-01-01", "2025-12-31"] # Changing this affects predictions!
step_len: 60🙋 Questions
- Is this behavior intentional or a bug?
- Are there any known dependencies between test segment range and prediction scores?
- What is the recommended approach for ensuring backtest reliability?
🔗 Related
- Framework: Qlib
- Component:
TSDatasetH, data loading pipeline - Use case: Backtesting, model validation
Environment:
- Qlib version: [latest]
- Python version: 3.10
- PyTorch version: 2.2.0
- OS: Linux
Thank you for your attention to this issue! 🙏