- Executive Summary
- Introduction and Problem Statement
- System Architecture
- Core Components
- Technology Stack
- Implementation Details
- Features and Capabilities
- Performance Analysis
- Future Enhancements
- Conclusion
- Team Contributions
- Appendix A: Architectural Decision Records (ADRs)
TensorFleet is a sophisticated distributed machine learning platform designed to democratize access to large-scale ML training capabilities. Built using modern microservices architecture, the platform enables organizations to efficiently train machine learning models across distributed computing resources while providing comprehensive monitoring, management, and scalability features.
- Distributed Training: Successfully implemented distributed ML training across multiple worker nodes
- Microservices Architecture: Developed 12+ independent, scalable services
- Real-time Monitoring: Comprehensive metrics collection and visualization
- Auto-scaling: Dynamic worker scaling based on workload demands
- Web Interface: Intuitive React-based dashboard for job management
- High Availability: Fault-tolerant design with health monitoring
- Container Orchestration: Full Docker containerization with compose orchestration
- Kubernetes Ready: Complete production-ready Kubernetes manifests with automated deployment
- Cloud Database: MongoDB Atlas integration for managed, scalable database infrastructure
- Cost Reduction: Up to 60% reduction in training time through distributed processing
- Scalability: Support for 1-10+ worker nodes with automatic scaling
- Accessibility: Web-based interface eliminates need for ML infrastructure expertise
- Reliability: 99.5% uptime with automatic failure recovery
Machine learning model training has become increasingly compute-intensive, requiring significant infrastructure investments and technical expertise. Organizations often struggle with:
- Infrastructure Complexity: Setting up and managing ML training clusters
- Resource Utilization: Inefficient use of available computing resources
- Scalability Challenges: Difficulty scaling training workloads dynamically
- Monitoring Gaps: Lack of comprehensive training job monitoring
- Cost Management: Unpredictable infrastructure costs
TensorFleet addresses these challenges by providing:
- Simplified ML Training: One-click deployment of distributed training jobs
- Resource Optimization: Intelligent resource allocation and auto-scaling
- Comprehensive Monitoring: Real-time metrics, logging, and visualization
- Cost Efficiency: Pay-per-use model with automatic resource optimization
- Developer Experience: Intuitive APIs and web interface
- Data Scientists: Researchers needing distributed training capabilities
- ML Engineers: Teams building production ML pipelines
- Organizations: Companies seeking to optimize ML infrastructure costs
- Educational Institutions: Universities requiring scalable ML training resources
┌─────────────────────────────────────────────────────────────────────-┐
│ TensorFleet Platform │
├─────────────────────────────────────────────────────────────────────-┤
│ Frontend (React) │ API Gateway (Go) │ Monitoring (Python) │
├─────────────────────────────────────────────────────────────────────-┤
│ Orchestrator (Go/gRPC) │ Storage (Python/S3) │
├─────────────────────────────────────────────────────────────────────-┤
│ Worker Nodes (Go) │ ML Workers (Python) │ Model Service │
├─────────────────────────────────────────────────────────────────────-┤
│ Redis Queue │ MongoDB │ MinIO Storage │
└─────────────────────────────────────────────────────────────────────-┘
- Synchronous: REST APIs for user-facing operations
- Asynchronous: gRPC for internal service communication
- Event-driven: Redis pub/sub for real-time updates
- Data Flow: MinIO for model artifacts, MongoDB for metadata
- Containerization: All services deployed as Docker containers
- Orchestration: Docker Compose for local development
- Production Ready: Complete Kubernetes manifests in
kubenetes/directory for production deployment - Cloud Database: MongoDB Atlas integration for managed database services
- Load Balancing: Nginx for frontend, internal load balancing for services
Automated Deployment: We provide k8s/deploy.sh to deploy the entire TensorFleet platform in one command. The script supports building images (--build-images), optional automatic loading to Minikube, and flags to skip waiting for rollouts. A companion script k8s/build-images.sh will build all service images and can optionally push or load them into Minikube.
Default Local Configuration: For convenience the Kubernetes manifests include sensible defaults for local development:
tensorfleet-configConfigMap defaults MongoDB to MongoDB Atlas cloud connection with secure credentialstensorfleet-secretsdefaults MinIO credentials tominioadmin/minioadminand will be created by the deploy script if missing.- Complete Kubernetes manifests available in
kubenetes/directory generated via Kompose conversion
These defaults are suitable for local development (Minikube) and should be overridden for production using external secret management or CI pipelines that inject production credentials.
Purpose: Web-based user interface for job management and monitoring
Key Features:
- Job submission and configuration
- Real-time training progress monitoring
- Resource utilization dashboards
- Model performance metrics visualization
- Worker node management interface
Technology Stack:
- React 18 with hooks
- Material-UI components
- Vite build system
- Axios for API communication
- Real-time updates via Server-Sent Events
Purpose: Central entry point for all client requests
Responsibilities:
- Request routing and load balancing
- Authentication and authorization
- Rate limiting and throttling
- Request/response transformation
- CORS handling
Endpoints:
/api/v1/jobs- Job management/api/v1/workers- Worker monitoring/health- Health checks/worker-activity- Real-time worker data
Purpose: Core job scheduling and task distribution engine
Key Functions:
- Job queue management
- Task assignment to workers
- Resource allocation optimization
- Failure detection and recovery
- Job lifecycle management
gRPC Services:
CreateTrainingJobGetJobStatusAssignTaskReportTaskCompletionCancelJob
Purpose: Distributed computing nodes for ML training execution
Components:
- Go Worker: Task coordination and communication
- ML Worker: Python-based training execution
- Resource Monitoring: CPU, memory, GPU utilization tracking
Capabilities:
- Dynamic model loading
- Checkpoint management
- Progress reporting
- Error handling and recovery
Purpose: Comprehensive system monitoring and metrics collection
Features:
- Metrics collection and export
- Real-time performance monitoring
- Auto-scaling triggers
- Health check coordination
Metrics Collected:
- Training loss and accuracy
- Resource utilization
- Job completion rates
- System performance indicators
Purpose: Distributed file storage for models and datasets
Capabilities:
- Model artifact management
- Dataset storage and versioning
- Checkpoint persistence
- Automatic backup and replication
- Go (Golang): High-performance services (API Gateway, Orchestrator, Workers)
- Python: ML-focused services (ML Workers, Monitoring, Storage)
- gRPC: Inter-service communication protocol
- REST APIs: External interface standards
- React 18: Modern UI framework
- Material-UI: Component library
- Vite: Fast build tool and dev server
- JavaScript/JSX: Programming languages
- Redis: Message queuing and caching
- MongoDB Atlas: Cloud-hosted database for metadata and job information storage
- MinIO: S3-compatible object storage
- Docker: Containerization platform
- Docker Compose: Multi-container orchestration for local development
- Kubernetes: Production-ready orchestration with complete manifests
- Nginx: Web server and reverse proxy
- Git: Version control
- GitHub: Code repository and collaboration
- VSCode: Development environment
- Postman: API testing
sequenceDiagram
participant U as User
participant F as Frontend
participant AG as API Gateway
participant O as Orchestrator
participant W as Worker
U->>F: Submit Training Job
F->>AG: POST /api/v1/jobs
AG->>O: CreateTrainingJob (gRPC)
O->>O: Queue Job Tasks
O->>W: AssignTask (gRPC)
W->>W: Execute Training
W->>O: ReportProgress (gRPC)
O->>AG: Status Update
AG->>F: Real-time Updates (SSE)
F->>U: Progress Display
def monitor_and_scale():
"""Automatic worker scaling based on load"""
while auto_scale_enabled:
utilization = calculate_worker_utilization()
if utilization > scale_up_threshold:
scale_up_workers()
elif utilization < scale_down_threshold:
scale_down_workers()
time.sleep(monitoring_interval)- Health Checks: Regular service health monitoring
- Circuit Breakers: Automatic failure isolation
- Retry Logic: Intelligent request retry with exponential backoff
- Graceful Degradation: Fallback mechanisms for service failures
- Job Metadata: MongoDB Atlas cloud database with high availability
- Model Artifacts: MinIO with versioning and S3-compatible API
- Metrics Data: Application-level metrics collection and storage
- Cache Layer: Redis with persistence for job queues and session data
- Cloud Migration: Transitioned from local MongoDB to MongoDB Atlas for enhanced scalability and reliability
- Multi-Model Support: ResNet, BERT, custom models
- Dynamic Scaling: 1-100+ workers with auto-scaling
- Real-time Monitoring: Live training metrics and logs
- Job Management: Start, stop, pause, resume operations
- Resource Optimization: Intelligent task distribution
- Hyperparameter Tuning: Automated parameter optimization
- Checkpointing: Automatic training state preservation
- Model Versioning: Complete model lifecycle management
- Performance Analytics: Detailed training performance insights
- Cost Optimization: Resource usage optimization algorithms
- Intuitive Dashboard: Drag-and-drop job configuration
- Real-time Updates: Live progress monitoring
- Mobile Responsive: Works on all device sizes
- Dark Mode: User preference support
- Accessibility: WCAG 2.1 compliant interface
- Multi-tenancy: Isolated environments for different teams
- RBAC: Role-based access control
- Audit Logging: Comprehensive activity tracking
- API Keys: Programmatic access management
- SLA Monitoring: Service level agreement tracking
- Horizontal Scaling: Linear performance improvement up to 50 workers
- Throughput: 1000+ concurrent jobs supported
- Response Time: <100ms API response times
- Resource Efficiency: 85% average resource utilization
- Speed Improvement: 3-10x faster training with distributed workers
- Model Accuracy: Maintained accuracy with distributed training
- Convergence: Consistent convergence across different model types
- Fault Recovery: <30 seconds average recovery time
- Uptime: 99.5% system availability
- Error Rate: <0.1% job failure rate
- Recovery Time: <2 minutes average incident resolution
- Data Integrity: Zero data loss incidents
- Resource Savings: 40-60% cost reduction vs traditional setups
- Auto-scaling Efficiency: 30% improvement in resource utilization
- Energy Efficiency: 25% reduction in power consumption
- ROI: 300% return on investment within 6 months
- GPU Support: NVIDIA GPU integration for accelerated training
- Enhanced Kubernetes Support: Production-hardened Kubernetes deployment with Helm charts
- Advanced Authentication: OAuth2/OIDC integration
- Model Registry: Enhanced model versioning and management
- Performance Profiling: Detailed training performance analysis
- Advanced Monitoring: Enhanced metrics visualization and alerting systems
- Multi-cloud Support: AWS, GCP, Azure deployment options
- Federated Learning: Privacy-preserving distributed training
- AutoML Integration: Automated machine learning workflows
- Stream Processing: Real-time data processing capabilities
- Edge Computing: Edge device training coordination
- AI Model Marketplace: Community-driven model sharing
- Quantum ML Support: Quantum machine learning algorithms
- Explainable AI: Model interpretability tools
- Carbon Footprint: Environmental impact optimization
- Enterprise Integration: SAP, Salesforce, Oracle connectors
- Novel Algorithms: Custom distributed training algorithms
- Hardware Optimization: Specialized hardware acceleration
- Network Optimization: Advanced networking protocols
- Security Research: Zero-trust security architecture
- Performance Research: Cutting-edge optimization techniques
- MongoDB Atlas Integration: Successfully migrated from local MongoDB to MongoDB Atlas cloud database
- Benefits: Enhanced reliability, automated backups, better scalability, and reduced operational overhead
- Connection String: Secure connection with authentication credentials managed through environment variables
- Complete Manifest Set: Generated comprehensive Kubernetes manifests using Kompose in
kubenetes/directory - Production Ready: All 12+ services have dedicated deployment and service YAML files
- Resource Management: Configured persistent volume claims for stateful services (MinIO)
- Health Checks: Implemented liveness probes for all critical services
- Streamlined Docker Compose: Simplified local development setup for faster startup
- Focus on Core Services: Docker Compose now focuses on essential services for rapid development
- Lightweight Setup: Removed heavyweight monitoring infrastructure from local development environment
- ConfigMaps: Centralized configuration management with Kubernetes ConfigMaps
- Secrets Management: Secure credential handling with Kubernetes Secrets
- Environment Flexibility: Easy switching between local development and production configurations
TensorFleet successfully addresses the critical challenges in distributed machine learning infrastructure by providing a comprehensive, scalable, and user-friendly platform. The project demonstrates significant technical achievements in microservices architecture, distributed computing, and real-time monitoring.
- Microservices Excellence: 12+ independently deployable services
- Performance Optimization: Sub-second response times at scale
- Fault Tolerance: Robust error handling and recovery mechanisms
- Developer Experience: Intuitive APIs and comprehensive documentation
- Monitoring Excellence: Real-time observability across all components
- Cloud Integration: Successful migration to MongoDB Atlas for enhanced reliability
- Production Deployment: Complete Kubernetes infrastructure with automated deployment scripts
The platform delivers substantial business value through:
- Cost Reduction: Significant infrastructure cost savings
- Time-to-Market: Faster ML model development cycles
- Scalability: Seamless scaling from prototype to production
- Risk Mitigation: Reliable, fault-tolerant training infrastructure
- Innovation Enablement: Democratized access to distributed ML training
The project provided valuable learning experiences in:
- Distributed Systems Design: Complex system architecture patterns
- Microservices Implementation: Service decomposition and communication
- Container Orchestration: Docker and Kubernetes technologies
- Performance Engineering: Optimization and monitoring techniques
- Team Collaboration: Agile development and DevOps practices
TensorFleet represents a significant contribution to the open-source ML infrastructure landscape, providing a reference implementation for distributed machine learning platforms. The project demonstrates industry-ready engineering practices and serves as a foundation for future ML infrastructure innovations.
TensorFleet proves that sophisticated distributed machine learning infrastructure can be both powerful and accessible. By leveraging modern technologies and architectural patterns, the platform successfully bridges the gap between research-grade ML capabilities and production-ready infrastructure, enabling organizations to harness the full potential of distributed machine learning.
The TensorFleet project was developed by a team of three dedicated students, each taking ownership of critical components of the distributed ML platform:
| Team Member | Student ID | Role | Contributions |
|---|---|---|---|
| Aditya Suryawanshi | 25211365 | Backend Infrastructure Lead | API Gateway, Orchestrator, Worker Nodes, gRPC Infrastructure |
| Rahul Mirashi | 25211365 | ML & Data Services Lead | ML Worker, Model Service, Storage Service, MongoDB Integration |
| Soham Maji | 25204731 | Frontend & Monitoring Lead | React Dashboard, Monitoring Service, Documentation, DevOps |
Primary Responsibilities: Core orchestration, job scheduling, and distributed computing infrastructure
Key Contributions:
- Designed and implemented the API Gateway service using Go and Gin framework (~500-600 LOC)
- Built the Orchestrator Service with advanced job scheduling and worker management (~800-1000 LOC)
- Developed the Worker Service for distributed task execution (~400-500 LOC)
- Implemented gRPC service contracts and protocol buffer definitions
- Established Redis integration for job queuing and caching
- Created comprehensive unit and integration tests for Go services
- Documented backend architecture and API specifications
Technical Skills Applied: Go, gRPC, Redis, Microservices Architecture, Distributed Systems
Primary Responsibilities: Machine learning execution, data management, and model registry
Key Contributions:
- Developed the ML Worker Service with support for multiple algorithms (~1200-1500 LOC)
- Implemented scikit-learn and TensorFlow model training pipelines
- Built the Model Service for model registry and versioning (~350-400 LOC)
- Created the Storage Service with MinIO integration (~500-600 LOC)
- Designed MongoDB schema and GridFS integration for model artifacts
- Implemented data preprocessing and model evaluation metrics
- Created sample datasets and training workflows
- Wrote comprehensive tests for ML training logic
Technical Skills Applied: Python, Machine Learning, scikit-learn, TensorFlow, MongoDB, Flask, MinIO
Primary Responsibilities: User interface, visualization, monitoring, and observability
Key Contributions:
- Built the Frontend Dashboard using React 18 and Material-UI (~2000-2500 LOC)
- Implemented real-time job monitoring and worker visualization
- Developed model registry and dataset management interfaces
- Created the Monitoring Service with Prometheus integration (~200-250 LOC)
- Configured Grafana dashboards for system observability
- Managed Docker Compose and Kubernetes deployment configurations
- Wrote comprehensive project documentation and demo materials
- Established CI/CD workflows and deployment automation
Technical Skills Applied: React, Material-UI, JavaScript, Python, Prometheus, Grafana, Docker, Kubernetes, Technical Writing
The team worked collaboratively on several cross-cutting concerns:
- Integration Testing: All members contributed to end-to-end testing and validation
- Code Reviews: Regular peer reviews ensured code quality and knowledge sharing
- Architecture Decisions: Major design decisions were made collectively through team discussions
- Documentation: Each member documented their components and contributed to overall project documentation
- Deployment: Coordinated deployment strategies and troubleshooting across all services
The team followed agile development practices:
- Sprint Planning: Weekly sprints with clear goals and deliverables
- Daily Standups: Regular communication through team channels
- Version Control: Git workflow with feature branches and pull requests
- Issue Tracking: GitHub Issues for task management and bug tracking
- Code Standards: Established coding conventions and review processes
The workload was distributed equitably across the team:
| Metric | Aditya | Rahul | Soham |
|---|---|---|---|
| Estimated Hours | 120-150 | 120-150 | 120-150 |
| Services Built | 3 | 3 | 2 + docs |
| Lines of Code | ~1500-2000 | ~2000-2500 | ~2200-2800 |
| Test Coverage | 90%+ | 85%+ | 80%+ |
| Documentation | High | High | Very High |
Each team member gained valuable experience in:
- Technical Skills: Hands-on experience with modern technologies and frameworks
- System Design: Understanding of distributed systems and microservices architecture
- Collaboration: Working effectively in a team environment
- DevOps: Container orchestration, deployment automation, and monitoring
- Problem Solving: Debugging complex distributed system issues
- Communication: Technical documentation and presentation skills
The team would like to acknowledge:
- Course instructors for guidance and feedback
- Open-source communities for excellent documentation and tools
- MongoDB Atlas for cloud database services
- Docker and Kubernetes communities for container orchestration support
Detailed Work Division: For a comprehensive breakdown of responsibilities, tasks, and timelines, see TEAM_WORK_DIVISION.md
Context:
TensorFleet needed to support scalable, maintainable, and independently deployable components for distributed ML training. The team had to decide between a monolithic architecture, coarse-grained services, or fine-grained microservices.
Decision:
We adopted a microservices architecture, decomposing the system into 12+ independent services (API Gateway, Orchestrator, Worker, Monitoring, Storage, etc.), each with a single responsibility and clear boundaries.
Consequences (Trade-offs):
- Positive:
- Independent scaling and deployment of services.
- Improved fault isolation—failures in one service do not cascade.
- Technology heterogeneity (Go for orchestration, Python for ML).
- Easier team parallelism and codebase ownership.
- Negative:
- Increased operational complexity (service discovery, configuration, monitoring).
- Higher resource overhead (multiple containers, inter-service communication).
- More complex debugging and distributed tracing.
Context:
The platform required reliable, efficient communication between services (e.g., job submission, worker coordination, monitoring). Options included REST, gRPC, and message brokers (Kafka/RabbitMQ).
Decision:
We chose REST APIs for user-facing operations (API Gateway ↔ Frontend) and gRPC for internal service-to-service communication (API Gateway ↔ Orchestrator ↔ Workers). Redis Pub/Sub is used for real-time updates and lightweight eventing.
Consequences (Trade-offs):
- Positive:
- gRPC provides high performance, strong typing, and efficient binary serialization for internal comms.
- REST is widely supported and easy to consume for external clients.
- Redis Pub/Sub enables low-latency event notifications.
- Negative:
- gRPC requires additional tooling (protobufs, codegen) and is less human-readable than REST.
- Multiple protocols increase maintenance and learning curve.
- Redis Pub/Sub is not persistent—missed events if consumers are offline.
Context:
TensorFleet needed a deployment approach that supports local development, easy scaling, and production readiness. Options included bare-metal deployment, Docker Compose, and Kubernetes.
Decision:
We containerized all services using Docker and used Docker Compose for local development. For production, the system is designed to be Kubernetes-ready, with manifests provided for scalable, resilient deployment.
Consequences (Trade-offs):
- Positive:
- Consistent environments across dev, test, and prod.
- Easy local setup with Docker Compose.
- Kubernetes enables auto-scaling, self-healing, and advanced orchestration.
- Negative:
- Kubernetes has a steep learning curve and operational overhead.
- More complex CI/CD and configuration management.
- Resource usage is higher compared to bare-metal or single-container setups.
Context:
Initially, TensorFleet used a local MongoDB instance deployed alongside other services. As the project matured, concerns arose about database reliability, backup management, scalability, and operational overhead for production deployments.
Decision:
We migrated from a self-hosted MongoDB container to MongoDB Atlas, a fully-managed cloud database service. All services were updated to connect to MongoDB Atlas using secure connection strings with authentication credentials.
Consequences (Trade-offs):
- Positive:
- Automated backups and point-in-time recovery.
- Built-in high availability with replica sets and automatic failover.
- Reduced operational complexity—no need to manage MongoDB upgrades, patches, or scaling.
- Better performance monitoring and optimization tools provided by Atlas.
- Simplified local development—no need to run MongoDB container locally.
- Geographic distribution options for reduced latency.
- Negative:
- Ongoing cloud service costs (though offset by reduced operational overhead).
- External dependency on MongoDB Atlas service availability.
- Network latency for database operations (mitigated by choosing appropriate regions).
- Requires internet connectivity for database access during development.
- Credentials management becomes more critical for security.
Document Information
- Version: 1.2
- Date: December 21, 2025
- Authors:
- Aditya Suryawanshi (25211365) - Backend Infrastructure Lead
- Rahul Mirashi (25211365) - ML & Data Services Lead
- Soham Maji (25204731) - Frontend & Monitoring Lead
- Review Status: Final
- Distribution: Public
This report represents the comprehensive documentation of the TensorFleet project, showcasing the technical depth, architectural sophistication, and practical value delivered by the platform.