This project is a simple implementation of a Distributed File System (DFS) in Go. It uses the Raft consensus algorithm to replicate file metadata (creations, deletions, etc.) across a cluster of nodes, ensuring consistency and fault tolerance.
Unlike basic metadata-only systems, this implementation also replicates file content from the leader to all follower nodes, improving availability and durability.
The Raft layer is based on a modified version of Phil Eaton's goraft.
-
Raft-based Metadata Replication: File operations (create, delete) are committed to a distributed log via Raft, ensuring metadata is strongly consistent across the cluster.
-
Leader Election: Nodes elect a leader automatically. All write operations go through the leader, while reads can be served from any replica.
-
File Content Replication: When a file is uploaded, the leader replicates its content to all followers. This ensures that file data is not lost if the leader crashes.
-
Basic Fault Tolerance: The system tolerates up to
(N-1)/2failures in a cluster of sizeN. Both metadata and file content remain available as long as a quorum survives.
-
Automatic Recovery Manager: Detects and recovers missing files by fetching them from healthy nodes (runs every 30 seconds)
-
Health Monitoring: Continuously monitors the health of all cluster nodes with configurable failure thresholds (checks every 5 seconds)
-
Data Repair & Integrity: Uses SHA-256 checksums to detect corrupted files and automatically repairs them from healthy replicas (verifies every 60 seconds)
-
Performance Metrics: Tracks operation latency, success rates, replication overhead, and provides trade-off analysis between reliability and performance
-
NTP-Based Time Synchronization: The system was upgraded to use the Network Time Protocol (NTP) for time synchronization.
- Previous Method (Removed): Follower nodes would ask the Raft leader for the current time. This approach was simple but had a major drawback: if the leader's clock was inaccurate, the entire cluster's time would be skewed.
- Current Method: Each node now independently contacts an external NTP server (
pool.ntp.org) to synchronize its clock. This removes the dependency on the leader, making timestamps significantly more accurate and robust across the entire cluster.
Interact with the DFS using REST-like endpoints for upload, download, delete, list, and status checks.
graph TD
A[Client] --> B(Leader Node);
B --> C(Follower Node 1);
B --> D(Follower Node 2);
subgraph Cluster
B((Leader Node<br/>Raft + Files))
C((Follower Node 1<br/>Raft + Files))
D((Follower Node 2<br/>Raft + Files))
end
style B fill:#f9f,stroke:#333,stroke-width:2px;
style C fill:#ccf,stroke:#333;
style D fill:#ccf,stroke:#333;
B -- Metadata: Raft Log --> C;
B -- File Content: Replication --> C;
B -- Metadata: Raft Log --> D;
B -- File Content: Replication --> D;
Metadata: Replicated via Raft log
File Content: Replicated from Leader to Followers
Reads: Can be served from any node
- Go: Version 1.20+
- PowerShell: To run the cluster startup script on Windows
- curl: For quick testing of the HTTP API
go build -o dfsapi.exe ..\start-cluster.ps1This launches 3 nodes:
- Node 1:
http://localhost:8081(Raft RPC on:3030) - Node 2:
http://localhost:8082(Raft RPC on:3031) - Node 3:
http://localhost:8083(Raft RPC on:3032)
curl http://localhost:8081/statusLook for "is_leader": true.
echo "hello distributed world" > my-test-file.txt
curl -X POST --data-binary @my-test-file.txt http://localhost:8081/upload/my-first-file.txtcurl http://localhost:8082/filescurl http://localhost:8083/upload/my-first-file.txtcurl -X DELETE http://localhost:8081/upload/my-first-file.txtcurl http://localhost:8081/recovery/statusShows total files, local files, and missing files that need recovery.
curl http://localhost:8081/health/statusDisplays health status of all cluster nodes with response times and failure counts.
curl http://localhost:8081/repair/statusShows verified files, repairs performed, and corruptions detected.
curl http://localhost:8081/metricsDisplays operation counts, average durations, success rates, and replication statistics.
curl http://localhost:8081/metrics/tradeoffsProvides analysis of reliability vs. performance trade-offs in the current configuration.
curl -X POST -H "Content-Type: application/json" \
-d '{"file_path":"my-first-file.txt"}' \
http://localhost:8081/repair/verifycurl -X POST http://localhost:8081/recovery/syncPOST /upload/{filename}- Upload a file (leader only)GET /upload/{filename}- Download a file (any node)DELETE /upload/{filename}- Delete a file (leader only)GET /files- List all files (any node)
GET /status- Node status and leader information
GET /recovery/status- Recovery manager statusPOST /recovery/sync- Trigger manual recovery syncGET /health/status- Cluster health monitoringGET /repair/status- Data repair and integrity statusPOST /repair/verify- Manually verify a specific fileGET /metrics- Performance metricsGET /metrics/tradeoffs- Trade-off analysis
POST /replicate/{filename}- Receive replicated file contentPOST /delete-content/{filename}- Delete local file content
- Runs every 30 seconds
- Compares local files with cluster metadata
- Automatically fetches missing files from healthy nodes
- Ensures eventual consistency of file content
- Performs health checks every 5 seconds
- Marks nodes unhealthy after 3 consecutive failures
- Provides real-time cluster health visibility
- Helps identify network partitions or node failures
- Calculates SHA-256 checksums for all files
- Verifies file integrity every 60 seconds
- Detects silent data corruption
- Automatically repairs corrupted files from healthy replicas
- Tracks all file operations (create, read, delete)
- Measures latency, throughput, and success rates
- Analyzes replication overhead
- Provides insights on reliability vs. performance trade-offs
- Logs summary every 5 minutes
Building this distributed file system involved overcoming several common but complex engineering challenges:
-
Concurrency and State Management:
- Challenge: Ensuring that shared data (like the file list or performance metrics) could be safely accessed by concurrent HTTP requests, background Raft processes, and timed fault-tolerance checks without causing race conditions.
- Example: The initial implementation for tracking the success/failure of a file upload required a complex wrapper around Go's standard
http.ResponseWriterjust to capture the status code from a concurrent handler. This highlights the difficulty of managing state across asynchronous operations.
-
Correctly Implementing Distributed Concepts:
- Challenge: The theoretical concepts of distributed systems have subtle but critical details. The initial implementation of time synchronization is a prime example.
- Example: The system first used a simple, leader-based time sync where followers would ask the leader for the time. The difficulty was realizing this approach was flawed—if the leader's clock was wrong, the entire cluster's time would be skewed. This led to its replacement with a more robust NTP-based solution, where each node consults an external, authoritative time source.
-
Robust Configuration and Startup:
- Challenge: Parsing command-line arguments manually is error-prone. The initial
getConfigfunction had to carefully iterate throughos.Argsto find flags and their values. - Example: This manual approach is brittle; it can easily break if arguments are provided in an unexpected order. While functional, it represents the common difficulty of building robust configuration handling without relying on standard libraries like Go's
flagpackage.
- Challenge: Parsing command-line arguments manually is error-prone. The initial
.
├── consensus/ # Raft consensus implementation
│ ├── raft.go # Core Raft logic
│ ├── raft_test.go # Persistence tests
│ └── util.go # Utility functions
├── replication/ # File replication logic
│ └── replication.go # State machine and handlers
├── faulttolerance/ # Fault tolerance features
│ ├── recovery.go # Recovery manager
│ ├── health.go # Health monitoring
│ ├── datarepair.go # Data integrity verification
│ ├── metrics.go # Performance metrics
│ └── faulttolerance.go # Delete operations
├── timesync/ # Time synchronization
│ └── timesync.go # Clock synchronization
├── main.go # Entry point and HTTP server
├── start-cluster.ps1 # Cluster startup script
└── README.md # This file