Description
Identified an issue in the evaluation pipeline that leads to significantly degraded evaluation results when using float16 together with flash_attention_2
Problem
The evaluation code converts the model to float16 after loading the checkpoint and enables flash_attention_2 in lines 147-148 of
src/modernvbert/contrastive_training/evaluate.py.
However, this approach does not work correctly for this checkpoint and produces very poor evaluation results.
Key points:
- The checkpoint parameters are stored in float32
- During evaluation, the model is:
- loaded in float32
- then converted to float16
- and evaluated with flash_attention_2
This configuration results in incorrect scores
When the checkpoint is evaluated without forcing float16 and FlashAttention, the results are significantly better and consistent.
- Training emits warnings indicating that
FlashAttention2 requires float16, suggesting it was not correctly enabled during training
- This implies the model was not trained with
FlashAttention2 in float16
Suggested Fixes
- Do not convert the model to float16 after loading if the checkpoint was trained in float32
- Only enable flash_attention_2 when the model is trained and stored in compatible precision
Description
Identified an issue in the evaluation pipeline that leads to significantly degraded evaluation results when using float16 together with flash_attention_2
Problem
The evaluation code converts the model to
float16after loading the checkpoint and enablesflash_attention_2in lines 147-148 ofsrc/modernvbert/contrastive_training/evaluate.py.However, this approach does not work correctly for this checkpoint and produces very poor evaluation results.
Key points:
This configuration results in incorrect scores
When the checkpoint is evaluated without forcing
float16andFlashAttention, the results are significantly better and consistent.FlashAttention2requiresfloat16, suggesting it was not correctly enabled during trainingFlashAttention2infloat16Suggested Fixes