Prediction scores change when test segment range changes, causing backtest unreliability

## 🐛 Bug Description

When using a trained model for inference on the same date, the prediction scores **vary significantly** depending on the `test` segment's end date in the dataset configuration. This makes backtesting results unreliable and inconsistent with live trading scenarios.

## 📋 Steps to Reproduce

### Setup
- Model: GATs (Graph Attention Time Series) with LSTM base
- Dataset: TSDatasetH with Alpha158 handler
- Framework: Qlib (latest version)
- Config: `workflow_config_gats_Alpha158.yaml`

### Test Case

**Scenario 1**: Test segment = `("2024-01-01", "2025-12-05")`
```python
dataset_config['kwargs']['segments']['test'] = ("2024-01-01", "2025-12-05")
dataset = init_instance_by_config(dataset_config)
predictions = model.predict(dataset)

# Check prediction for stock SH688270 on 2025-12-05
score_1 = predictions.loc[('2025-12-05', 'SH688270')]
# Result: -0.24687856
```

**Scenario 2**: Test segment = `("2024-01-01", "2025-12-31")`
```python
dataset_config['kwargs']['segments']['test'] = ("2024-01-01", "2025-12-31")
dataset = init_instance_by_config(dataset_config)
predictions = model.predict(dataset)

# Check prediction for the SAME stock on the SAME date
score_2 = predictions.loc[('2025-12-05', 'SH688270')]
# Result: 0.59764999
```

**Difference**: `|score_1 - score_2| = 0.8445` 🔴

This is a **huge difference** for the exact same stock on the exact same date!

## 🎯 Expected Behavior

When predicting for date `T`, the prediction should be **invariant** to the test segment's end date, as long as:
1. The end date is `>= T` (sufficient data available)
2. The feature values for date `T` are identical
3. The model is using the same weights

**Rationale**: In live trading on 2025-12-05, we only have data up to that date. The backtest should simulate this exact scenario to provide reliable performance metrics.

## ❌ Actual Behavior

The prediction for date `T` **changes** when the test segment's end date changes, even though:
- ✅ Feature values are identical (verified)
- ✅ Label values are identical (verified)
- ✅ Sample availability is identical (verified)
- ✅ Normalization parameters are identical (verified with `fit_start_time` and `fit_end_time`)

This suggests the test segment range affects some internal state or processing in `TSDatasetH` or the data loading pipeline.

## 🔍 Investigation Results

### What We Verified

1. **Handler-level features** (via `handler.fetch()`): ✅ Identical
2. **Handler normalization** (`RobustZScoreNorm` with fixed `fit_start_time/fit_end_time`): ✅ Identical
3. **Sample availability** on target date: ✅ Identical (same stocks)
4. **Learn processors** (like `CSRankNorm`): ✅ Disabled for inference

## 📊 Additional Context

**Configuration Example**:
```yaml
dataset:
  class: TSDatasetH
  kwargs:
    handler:
      class: Alpha158
      kwargs:
        start_time: "2008-01-01"
        end_time: "2025-12-31"
        fit_start_time: "2000-01-01"  # Fixed normalization range
        fit_end_time: "2020-12-31"    # Fixed normalization range
        instruments: "all"
        infer_processors:
          - class: RobustZScoreNorm
            kwargs:
              fit_start_time: "2000-01-01"
              fit_end_time: "2020-12-31"
              fields_group: feature
          - class: Fillna
        learn_processors: []  # Disabled for inference
    segments:
      train: ["2000-01-01", "2020-12-31"]
      valid: ["2021-01-01", "2023-12-31"]
      test: ["2024-01-01", "2025-12-31"]  # Changing this affects predictions!
    step_len: 60
```

## 🙋 Questions

1. Is this behavior intentional or a bug?
2. Are there any known dependencies between test segment range and prediction scores?
3. What is the recommended approach for ensuring backtest reliability?

## 🔗 Related

- Framework: Qlib
- Component: `TSDatasetH`, data loading pipeline
- Use case: Backtesting, model validation

---

**Environment**:
- Qlib version: [latest]
- Python version: 3.10
- PyTorch version: 2.2.0
- OS: Linux

Thank you for your attention to this issue! 🙏


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prediction scores change when test segment range changes, causing backtest unreliability #2080

🐛 Bug Description

📋 Steps to Reproduce

Setup

Test Case

🎯 Expected Behavior

❌ Actual Behavior

🔍 Investigation Results

What We Verified

📊 Additional Context

🙋 Questions

🔗 Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Prediction scores change when test segment range changes, causing backtest unreliability #2080

Description

🐛 Bug Description

📋 Steps to Reproduce

Setup

Test Case

🎯 Expected Behavior

❌ Actual Behavior

🔍 Investigation Results

What We Verified

📊 Additional Context

🙋 Questions

🔗 Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions