From understanding there r 2 parts of it,
- LLM-as-a-judge, to generate reward score. As first step evaluate this strategy can help during re-generation request with exhaustive beam search.
- Self-training/modification on preference pairs
#Reference: https://arxiv.org/abs/2401.10020
From understanding there r 2 parts of it,
#Reference: https://arxiv.org/abs/2401.10020