Skip to content

Prediction scores change when test segment range changes, causing backtest unreliability #2080

@tycallen

Description

@tycallen

🐛 Bug Description

When using a trained model for inference on the same date, the prediction scores vary significantly depending on the test segment's end date in the dataset configuration. This makes backtesting results unreliable and inconsistent with live trading scenarios.

📋 Steps to Reproduce

Setup

  • Model: GATs (Graph Attention Time Series) with LSTM base
  • Dataset: TSDatasetH with Alpha158 handler
  • Framework: Qlib (latest version)
  • Config: workflow_config_gats_Alpha158.yaml

Test Case

Scenario 1: Test segment = ("2024-01-01", "2025-12-05")

dataset_config['kwargs']['segments']['test'] = ("2024-01-01", "2025-12-05")
dataset = init_instance_by_config(dataset_config)
predictions = model.predict(dataset)

# Check prediction for stock SH688270 on 2025-12-05
score_1 = predictions.loc[('2025-12-05', 'SH688270')]
# Result: -0.24687856

Scenario 2: Test segment = ("2024-01-01", "2025-12-31")

dataset_config['kwargs']['segments']['test'] = ("2024-01-01", "2025-12-31")
dataset = init_instance_by_config(dataset_config)
predictions = model.predict(dataset)

# Check prediction for the SAME stock on the SAME date
score_2 = predictions.loc[('2025-12-05', 'SH688270')]
# Result: 0.59764999

Difference: |score_1 - score_2| = 0.8445 🔴

This is a huge difference for the exact same stock on the exact same date!

🎯 Expected Behavior

When predicting for date T, the prediction should be invariant to the test segment's end date, as long as:

  1. The end date is >= T (sufficient data available)
  2. The feature values for date T are identical
  3. The model is using the same weights

Rationale: In live trading on 2025-12-05, we only have data up to that date. The backtest should simulate this exact scenario to provide reliable performance metrics.

❌ Actual Behavior

The prediction for date T changes when the test segment's end date changes, even though:

  • ✅ Feature values are identical (verified)
  • ✅ Label values are identical (verified)
  • ✅ Sample availability is identical (verified)
  • ✅ Normalization parameters are identical (verified with fit_start_time and fit_end_time)

This suggests the test segment range affects some internal state or processing in TSDatasetH or the data loading pipeline.

🔍 Investigation Results

What We Verified

  1. Handler-level features (via handler.fetch()): ✅ Identical
  2. Handler normalization (RobustZScoreNorm with fixed fit_start_time/fit_end_time): ✅ Identical
  3. Sample availability on target date: ✅ Identical (same stocks)
  4. Learn processors (like CSRankNorm): ✅ Disabled for inference

📊 Additional Context

Configuration Example:

dataset:
  class: TSDatasetH
  kwargs:
    handler:
      class: Alpha158
      kwargs:
        start_time: "2008-01-01"
        end_time: "2025-12-31"
        fit_start_time: "2000-01-01"  # Fixed normalization range
        fit_end_time: "2020-12-31"    # Fixed normalization range
        instruments: "all"
        infer_processors:
          - class: RobustZScoreNorm
            kwargs:
              fit_start_time: "2000-01-01"
              fit_end_time: "2020-12-31"
              fields_group: feature
          - class: Fillna
        learn_processors: []  # Disabled for inference
    segments:
      train: ["2000-01-01", "2020-12-31"]
      valid: ["2021-01-01", "2023-12-31"]
      test: ["2024-01-01", "2025-12-31"]  # Changing this affects predictions!
    step_len: 60

🙋 Questions

  1. Is this behavior intentional or a bug?
  2. Are there any known dependencies between test segment range and prediction scores?
  3. What is the recommended approach for ensuring backtest reliability?

🔗 Related

  • Framework: Qlib
  • Component: TSDatasetH, data loading pipeline
  • Use case: Backtesting, model validation

Environment:

  • Qlib version: [latest]
  • Python version: 3.10
  • PyTorch version: 2.2.0
  • OS: Linux

Thank you for your attention to this issue! 🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions