Have we found the baselines to be reproducible?

<img width="1661" height="328" alt="Image" src="https://github.com/user-attachments/assets/632b1803-d49b-462b-94ea-8f5584a08fda" /> 

Running the baseline `ippo_ff_mpe.py` gives me pretty terrible returns (running this with simple_spread means reward is just the distance between agents and landmarks) as I'm converging on -60 reward. Are there baseline metrics recorded to compare to?