Skip to content

Release 0.3.0: WebSocket-based MARS client and Server#8

Open
gbiavati wants to merge 45 commits intomainfrom
websocketmars
Open

Release 0.3.0: WebSocket-based MARS client and Server#8
gbiavati wants to merge 45 commits intomainfrom
websocketmars

Conversation

@gbiavati
Copy link
Contributor

@gbiavati gbiavati commented Feb 10, 2026

Summary

Introduces WebSocket-based MARS client architecture with backward-compatible modal behavior.

Default Behavior (100% Backward Compatible)

  • USE_SHARES=false (default) → Uses existing pipe-based client
  • No configuration changes required
  • All existing deployments continue to work unchanged

New Features (Opt-In via USE_SHARES=true)

  • WebSocket client/server for shared filesystem deployments
  • Real-time log streaming (no HTTP polling overhead)
  • Connection pooling with automatic failover
  • Client-side log filtering with injectable custom handlers
  • Graceful process management (no orphaned processes)

Why WebSocket instead of HTTP?

WebSocket provides superior architecture for long-running MARS jobs:

  • Real-time bidirectional communication: Logs stream as they occur, no polling
  • Single persistent connection: Handles jobs running for minutes/hours
  • Lower latency: Immediate notification of completion/errors
  • Interactive control: Send commands (kill, etc.) without new connections
  • Efficient: No repeated status check requests consuming resources

HTTP polling would require periodic requests, delayed notifications, complex state management, and higher server load.

Changes

See CHANGELOG.md

Added

  • WebSocket client/server (ws_client.py, ws_server.py)
  • Modal client selection via USE_SHARES config
  • Client-side log filtering with custom handler support
  • CephFS health diagnostics (check-cephfs-health)
  • Process group management for clean shutdown

Fixed

  • Orphaned process accumulation during restarts
  • Graceful shutdown with SIGTERM/SIGINT handlers

Documentation

  • Comprehensive README with both modes documented
  • LOG_FILTERING.md with custom handler examples
  • CEPHFS_ARCHITECTURE.md for troubleshooting

Breaking Changes

  • Applications using USE_SHARES=true must upgrade to cads-mars-server>=0.3.0
  • Shared filesystem required for WebSocket mode

Migration

No action required for existing deployments. To opt-in to WebSocket mode:

  1. Set MARS_USE_SHARES=true
  2. Configure WebSocket servers: MARS_WS_SERVERS="ws://server1:9001,..."
  3. Start WebSocket servers: ws-mars-server --host 0.0.0.0 --port 9001

See README.md for details.

- Added detailed explanations of CephFS architecture and components to the health check script.
- Updated health check script to clarify OSD connection issues and their implications.
- Improved logging messages to better inform users about OSD and MDS issues.
- Suggested actions for users to take when encountering OSD-related problems.
- Add CHANGELOG.md with WebSocket vs HTTP rationale
- Update README.md with configuration and usage guide
- Document log filtering and CephFS diagnostics
- Updated CEPHFS architecture documentation for clarity on monitor, metadata server, and object storage daemon roles.
- Improved log filtering in the WebSocket client to enhance user experience by reducing noise and highlighting important messages.
- Added examples of custom log handlers for MARS requests to demonstrate advanced logging capabilities.
- Adjusted project metadata in pyproject.toml to reflect the current development status and Python version requirements.
- Enhanced the check_cephfs_health script to provide clearer error messages and suggestions for resolving OSD issues.
- Refined test_config.py to improve output formatting and ensure accurate configuration precedence verification.
@gbiavati gbiavati changed the title Release 3.0.0: WebSocket-based MARS client and Server Release 0.3.0: WebSocket-based MARS client and Server Feb 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant