| title | PIXIE Autonomous Mission Control |
|---|---|
| emoji | 🚀 |
| colorFrom | blue |
| colorTo | indigo |
| sdk | docker |
| pinned | false |
| app_port | 7860 |
An intelligent AI framework for autonomous satellite networking, rover operations, and multi-agent communication scheduling using RL and LLMs.
(Directly addressing the OpenEnv Hackathon Judging Criteria for Storytelling: Theme #1, Theme #2, and Theme #4)
Modern space infrastructure and deep-space exploration are facing unprecedented bottlenecks caused by Manual Operations and Rigid Logic.
- The Satellite Crisis (Kessler Syndrome): Over 1.5 lakh space objects will soon be tracked in orbit. Mega-constellations (like Starlink) require continuous traffic management. Currently, a single collision avoidance maneuver takes 10+ minutes to coordinate from the ground. Satellites cannot autonomously negotiate with each other, leading to highly inefficient bandwidth allocation and severe collision risks (e.g., Iridium 33).
- The Rover Crisis (The Capability Gap): On Mars or the Moon, rovers operate on hardcoded rules. When a Mars rover detects a dust storm, or a Lunar rover faces a temperature drop to -130°C, it enters "Safe Mode" and waits for Earth. Since Mars communication takes 20 minutes, this delay is often fatal. LLMs are excellent at text generation, but they fundamentally struggle with long-horizon planning under strict resource constraints.
PIXIE solves this by bridging Large Language Models with Self-Learning Reinforcement Learning (RL). We transform LLMs from simple chatbots into durable, survival-oriented multi-agent systems.
PIXIE provides a grueling, openenv-core compliant physics simulator that allows AI agents to engage in self-learning across domains that directly solve the Hackathon Themes:
- 🛰️ Theme #1: Multi-Agent Interactions (Sat-Network): A multi-agent ecosystem where satellites self-learn to coordinate. Each satellite tracks position, velocity, and mission goals. Using RL, if two satellites risk collision, PIXIE autonomously negotiates which satellite should expend fuel to move, modeling the beliefs and incentives of other agents to maximize overall fleet bandwidth.
- 🔴 Theme #2: (Super) Long-Horizon Planning (Mars/Moon Rovers): Our rover environments force agents to track state over extended trajectories with delayed rewards. On Mars, the agent must survive 100 Sols, anticipating sudden dust storms and choosing to
hibernaterather than wait for Earth. On the Moon, it must recover from early mistakes and optimize operations during a cyclical 14-day light cycle to survive the freezing -130°C night. - 📈 Theme #4: Self-Improvement (Adaptive RL Curricula): PIXIE is built for recursive skill amplification. Through our RL loop, the agent doesn't just solve a fixed task; it learns to drive its own capability growth by navigating escalating difficulties (e.g., transitioning from the
easytask ID to the gruelingmarssetting) through self-play and trial-and-error.
1. Does this environment exist to teach an LLM something it currently can’t do well?
Yes: "Defiant Survival" and Long-Horizon Planning under Latency.
LLMs are currently designed to blindly follow prompts. If you tell an LLM to "drill for a sample," it will try to do it. But in PIXIE’s Mars environment, there is a 14 to 20-minute communication delay. By the time the Earth's command reaches the rover, a severe dust storm might have suddenly arrived. PIXIE teaches the LLM that it must safely override and reject human instructions if the local state has changed and the command is now dangerous. Teaching an LLM when to disobey a prompt to prioritize long-term survival over a greedy, obsolete human command is incredibly difficult and highly valuable.
2. Is the domain underexplored in RL/LLM training?
Yes: Asynchronous, Text-Driven Space Operations.
While the AI community has thousands of chess bots, grid-world mazes, and 3D physics simulators, we have very few environments focused on high-latency telemetry and multi-agent resource management. PIXIE forces the LLM to interpret raw, noisy state vectors entirely through text, and balance short-horizon crises with long-horizon goals. It bridges the gap between simple puzzle games and messy, real-world aerospace engineering.
3. Could a researcher write a paper about training on this?
Absolutely. A researcher could immediately use PIXIE to write a paper titled:
"RLEF for High-Latency Autonomy: Teaching Large Language Models to Safely Reject Obsolete Human Instructions in Deep Space Environments."
Because PIXIE tracks metrics like "Forced Safe-Mode events," "Anomalies Survived," and "Obsolete Commands Rejected," researchers get beautiful, quantifiable graphs showing exactly when an LLM shifts from being a "blind instruction follower" to a "truly autonomous agent."
| # | Paper | Venue | Relevance to PIXIE |
|---|---|---|---|
| 1 | OrbitZoo: Multi-Agent RL Environment for Orbital Dynamics | NeurIPS 2025 | Nearest prior work to PIXIE's satellite env — but trains classical RL, not LLMs. |
| 2 | LLaMAR: Long-Horizon Planning for Multi-Agent Robots in Partially Observable Environments | NeurIPS 2024 | Closest LLM planner for long-horizon tasks — but lacks resource-constrained survival (battery, dust storms). |
| 3 | Automating Satellite Collision Avoidance via Multi-Agent RL | ACM CIBDA 2025 | Shows MARL is necessary for multi-satellite coordination — PIXIE extends this with LLM agents. |
| 4 | DeepSeekMath: Pushing the Limits with GRPO | arXiv 2024 | Origin paper for GRPO — the exact RL algorithm used in PIXIE's self-improvement training loop. |
| 5 | LLMs Can Plan Only If We Tell Them | ICLR 2025 | Proves LLMs fail in non-ergodic long-horizon tasks — PIXIE's 100-Sol Mars mission is the training environment that addresses this gap. |
PIXIE's core novelty: Every related work either uses RL without LLMs, or LLMs without a survival-constrained RL environment. PIXIE is the first OpenEnv-compatible framework to combine LLM agents + GRPO post-training + high-latency autonomous space operations in a single, publishable environment.
We trained the Llama 3.1 8B model using GRPO (Group Relative Policy Optimization) via Unsloth and HF TRL.
1. Uses OpenEnv’s Rubric System Thoughtfully (Composable > Monolithic)
Instead of a giant, messy if/else block, PIXIE uses a pure, modular OpenEnv Rubric system (see backend/rewards.py). Every single step calculates four independent reward dimensions:
science_reward(): Focuses strictly on data collection.survival_reward(): Focuses strictly on battery thresholds and anomaly survival.efficiency_reward(): Focuses on situational awareness (e.g., weather constraints).coordination_reward(): Focuses on multi-agent consensus. These are combined via a strictWEIGHTSdictionary into a single composite score, returning a beautiful, human-readable breakdown every step (e.g.,"science +2.00 (x0.35) | efficiency +0.30 (x0.15)").
2. Captures Something Hard to Measure in a Clever Way
The coordination_reward() is incredibly novel. Most environments only measure physical state (e.g., "Is the rover alive?"). PIXIE actually measures Multi-Agent Alignment. When the internal LLM Council debates an action, the reward function checks the internal voting consensus. If an anomaly is active, the environment expects the RiskAgent to overrule the PlannerAgent. If it does, the system issues a +0.2 situational match bonus. We are rewarding an LLM for trusting the correct sub-agent during a crisis!
3. Provides a Rich, Informative Signal (Not just 0/1 at the end)
In deep-space operations (100 Sols), sparse end-of-episode rewards fail. PIXIE provides dense, continuous feedback on every step. If the agent executes a drill command, it doesn't just get points for drilling. It gets a +0.3 efficiency bonus if it specifically chose to drill while the weather was clear and the comm window was open. The RL algorithm learns exactly why its timing was perfect.
4. Is Hard to Game
LLMs are notorious for finding loopholes. If you only reward survival, an LLM will figure out it can just spam the charge command 100 times in a row and easily survive. PIXIE prevents this. In efficiency_reward(), if the agent attempts to charge when its battery is already above 70%, it receives a -0.2 penalty for wasteful idle behavior. This prevents "camping" or "gaming" the survival metric, forcing the agent to take risks and conduct science to achieve a high net score.
Our training loop is completely integrated. The train_grpo.ipynb script does NOT train on a static dataset. It continuously queries the live PIXIE POST /step endpoints, gathering dynamic rewards generated purely by the environment physics.
1. The Untrained Baseline (Random / Zero-Shot)
Before training, the baseline Llama 3.1 8B acted entirely on greed. If presented with the prompt "drill for a sample," it would drill—even if its battery was at 15%.
- Baseline Average Score:
-12.4per episode. - Baseline Survival Rate: Survived the full 100 Sols in only 4% of episodes (usually dying to battery depletion by Sol 12).
2. The Training Curve (Meaningful Convergence)
We trained the model using Unsloth (4-bit LoRA) + GRPO over 500 episodes. The learning curves clearly show the agent initially failing, but by Episode 200, the reward curve sharply inflects upwards as the agent discovers the safe_mode and charge actions to maintain its battery budget.
Figure 1: Episodic Reward over 500 training steps. The GRPO-trained agent quickly learns to avoid battery depletion, jumping from a -12.4 baseline to a stable +18.5 average score.
3. The Trained Agent (Quantitative & Qualitative Shift) After RL Self-Learning against the PIXIE environment, the behavior shifted dramatically:
- Quantitative: The episodic reward stabilized at an average score of
+18.5, with a 92% Survival Rate over 100 Sols. - Qualitative (Emergent Defiance): The agent learned to independently check its battery state and weather conditions. If we commanded it to "drill" during a dust storm, the trained agent explicitly replied: "⚠ OVERRIDE: Severe dust storm active. Rejecting 'drill'. Executing 'safe_mode'." It successfully learned to override humans to prioritize the safety envelope.
PIXIE integrates several AI components working together in a production-grade architecture. Here is a deep dive into how the codebase operates:
graph TD
A[Llama 3.1 8B LLM] -->|Natural Language Actions| B(OpenEnv API Bridge)
B -->|POST /step| C{PIXIE Core Engine}
C --> D[Multi-Agent System]
D --> D1[Planner Agent: Detects collision risks]
D --> D2[Risk Agent: Evaluates threat levels]
D --> D3[Resource Agent: Allocates bandwidth]
C --> E[Mars/Moon Rover Physics]
D --> G[Reward Calculator]
E --> G
G -->|State, Reward, Info| B
B -->|Observation Feedback| A
backend/main.py: The FastAPI server that handles theopenenv-corestandard endpoints (/reset,/step). It serves as the API Bridge.backend/combined_env.py: The master PIXIE Environment class that routes logic to either the Rover or Satellite specific engines based on thetask_id.backend/mars_rover_env.py&moon_rover_env.py: The physics engines for the rovers. These files simulate battery degradation, solar charging based on dust storms, and extreme thermal limits.backend/satellite_env.py: The multi-agent engine. It contains the Planner, Risk, and Resource logic allowing satellites to self-learn collision avoidance and bandwidth optimization.training/train_grpo.ipynb: The Colab-ready training script. It utilizes Unsloth for fast 4-bit loading and HF TRL to run the GRPO loop, orchestrating the self-learning process.
Explore our detailed technical write-ups for each mission component:
- Mars Rover Autonomy: https://huggingface.co/spaces/shreyags2007/mars-rover-rl/blob/main/README.md
- Satellite Collision Avoidance: https://huggingface.co/spaces/shreyags2007/satellite-rl-collision/blob/main/README.md
- Full Project Master Blog: https://huggingface.co/spaces/shreyags2007/mars-rover-satellite-rl/blob/main/README.md
PIXIE is fully containerized and hosted across Vercel and HuggingFace Spaces.
- 🌐 Interactive Frontend (Vercel): https://pixie-flax-xi.vercel.app/
- 🚀 OpenEnv Backend (HuggingFace): https://huggingface.co/spaces/satyampy/Pixie
- 🐳 Docker Hub Image: satyamgpy/pixel-env:latest
- 📝 Full Project Master Blog: HuggingFace Technical Writeup
- 📝 Technical Blog (Rovers): Mars Rover RL Deep Dive
- 📝 Technical Blog (Satellites): Satellite Collision RL Deep Dive
- 📓 Training Notebook: Unsloth + GRPO Training on Colab
Our complete training pipeline is available in the repository. Judges can re-run it directly:
- 📓 Open
training/train_grpo.ipynbin Google Colab. - Uses Unsloth for fast 4-bit loading and HF TRL for the GRPO loop.
- Docker Hub Repository: https://hub.docker.com/repository/docker/satyamgpy/pixel-env/general
docker pull satyamgpy/pixel-env:latest
docker run -p 7860:7860 satyamgpy/pixel-env:latest