doc: distributed training

## How to do distributed training: 

### Load data and prepare on graph partition

```python
import dgl

g = ...  # load the DGLGraph object with nebula-dgl
dgl.distributed.partition_graph(g, 'mygraph', 2, 'data_root_dir')
```

It'll output the partitioned graph as:

```
data_root_dir/
  |-- mygraph.json          # metadata JSON. File name is the given graph name.
  |-- part0/                # data for partition 0
  |  |-- node_feats.dgl     # node features stored in binary format
  |  |-- edge_feats.dgl     # edge features stored in binary format
  |  |-- graph.dgl          # graph structure of this partition stored in binary format
  |
  |-- part1/                # data for partition 1
     |-- node_feats.dgl
     |-- edge_feats.dgl
     |-- graph.dgl
```

See more on the reference docs:

ref:

- https://docs.dgl.ai/guide/distributed-preprocessing.html
- https://docs.dgl.ai/guide_cn/distributed-preprocessing.html
- https://docs.dgl.ai/en/0.8.x/guide/distributed.html

### Prepare distributed training env

- create a cluster of machines
- upload training script and partitioned data to each cluster
  - Could consider NFS/JuiceFS for ease of data access from distributed servers
- SSH access, prepare SSH pub key to enable password-less SSH auth
- Launch training job

ref:
- https://docs.dgl.ai/en/1.1.x/tutorials/dist/1_node_classification.html#set-up-distributed-training-environment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc: distributed training #16

How to do distributed training:

Load data and prepare on graph partition

Prepare distributed training env

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

doc: distributed training #16

Description

How to do distributed training:

Load data and prepare on graph partition

Prepare distributed training env

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions