Skip to content

doc: distributed training #16

@wey-gu

Description

@wey-gu

How to do distributed training:

Load data and prepare on graph partition

import dgl

g = ...  # load the DGLGraph object with nebula-dgl
dgl.distributed.partition_graph(g, 'mygraph', 2, 'data_root_dir')

It'll output the partitioned graph as:

data_root_dir/
  |-- mygraph.json          # metadata JSON. File name is the given graph name.
  |-- part0/                # data for partition 0
  |  |-- node_feats.dgl     # node features stored in binary format
  |  |-- edge_feats.dgl     # edge features stored in binary format
  |  |-- graph.dgl          # graph structure of this partition stored in binary format
  |
  |-- part1/                # data for partition 1
     |-- node_feats.dgl
     |-- edge_feats.dgl
     |-- graph.dgl

See more on the reference docs:

ref:

Prepare distributed training env

  • create a cluster of machines
  • upload training script and partitioned data to each cluster
    • Could consider NFS/JuiceFS for ease of data access from distributed servers
  • SSH access, prepare SSH pub key to enable password-less SSH auth
  • Launch training job

ref:

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions