-
-
Notifications
You must be signed in to change notification settings - Fork 660
Description
Overview
Adding new hosts within a cluster with TLS enabled is problematic due to a prerequisite that new nodes should have a 14 GB file distributed using the BitTorrent client running on these hosts. This torrent process is stuck indefinitely.
Architecture
Cluster Architecture
Within our cluster, we have a master node and worker nodes that report the cluster's state to the master. The master generates the .torrent file, which is a trackerless torrent file. The master somewhat acts as a tracker, providing each peer with information about other peers to communicate with during torrenting.
Torrent Architecture
Torrent Process During Fresh Cluster Install
This is the process followed during a fresh cluster setup:
- The master node first downloads (HTTP fetch) the parcel from a remote server, then acts as a web seed.
- Each peer gets other peers' information (host IP, port) using the heartbeat response from the master node. Each peer then calls the AddPeers() API from the torrent client. This AddPeers() API call happens after every heartbeat response from the master to the worker peer.
- Torrenting starts between the peers (master + workers).
- There is fallback logic that if the parcel download doesn't complete via BitTorrent in a certain time, the fallback mechanism is to do an HTTP download from the web seed.
Torrent Process During New Host Addition in Existing Cluster
This is the general flow of how new hosts are added in an existing cluster:
- A set of new hosts getting added to the cluster install the Anacrolix/torrent binary and the libtorrent binary. By default, the Anacrolix/torrent client process runs.
- These new peers/hosts/nodes contact the web seed (master node) present in the existing cluster (TLS enabled or not) to download the 14 GB file.
- Simultaneously, these new peers start distributing pieces of this 14 GB file among each other.
Scenarios with New Host(s) Addition
Without TLS Enabled on the Existing Cluster
- Whether using Anacrolix/Torrent or the libtorrent enabled on the new hosts being added, the 14 GB file gets distributed quickly, and these new hosts are added to the cluster.
- If Anacrolix/Torrent is used, during torrenting of the 14 GB parcel, these peers have web seed information in their statuses:
webseeds:
- CLOSED: http://ccycloud-1.b-135-no-tls.root.comops.site:7180/cmf/parcel/download/CDH-7.2.18-1.cdh7.2.18.p0.51297892-el8.parcel
last unhandled error: never
bep40-prio: e97fd7f2
last msg: never, connected: never, last helpful: 147.05s ago, itime: 2m41.004987105s, etime: 13.875250838s
1669/1669 completed, 0 pieces touched, good chunks: 40889/40889:0 reqq: 0+0/(84/128):0/1024, flags: i:WS:, dr: 47132.0 KiB/s
requested pieces:
With TLS Enabled in the Existing Cluster
Case #1: Libtorrent Client Process Runs on the New Hosts
The 14 GB file gets distributed within a few minutes.
Case #2: Anacrolix/Torrent Process Runs on the New Hosts
The 14 GB file distribution gets stuck on these new nodes because none of the new peers can contact the web seed (master node) present in the existing cluster. In the web seed section of full-status, it is empty:
webseeds: <--- no web seed
2 peer conns:
- 10.140.93.137:51680-10.140.40.8:7191
peer id: "-GT0003-\xb3.\x9epQ\xd6LG\x03\xad\xce8"
extensions: 0000000000100005 (ltep, fast, dht)
ltep extensions: map[ut_holepunch:2 ut_metadata:1 ut_pex:3]
pex: 2 conns, 0 unsent events
bep40-prio: e8a31f71
last msg: 26.36s ago, connected: 86.37s ago, last helpful: never, itime: 0s, etime: 0s
0/1669 completed, 0 pieces touched, good chunks: 0/0:0 reqq: 0+0/(1/1024):0/1024, flags: :M,e,v1:, dr: 0.0 KiB/s
requested pieces:
- 10.140.93.137:7191-10.140.24.8:43468
peer id: "-GT0003-\xfc\x93{w:\x94~\x8f\x13\u0671\x1b"
extensions: 0000000000100005 (ltep, fast, dht)
ltep extensions: map[ut_holepunch:2 ut_metadata:1 ut_pex:3]
pex: 2 conns, 0 unsent events
bep40-prio: d766eef0
last msg: 86.29s ago, connected: 86.29s ago, last helpful: never, itime: 0s, etime: 0s
0/1669 completed, 0 pieces touched, good chunks: 0/0:0 reqq: 0+0/(1/1024):0/1024, flags: :I,e,v1:, dr: 0.0 KiB/s
requested pieces:
Hi @anacrolix, Can you please provide pointers on why this API: http://<master-node>:<TLS-port>/cmf/parcel/download/<file-to-download> is not reachable from peer to the web seed present?