m4: A Learned Flow-level Network Simulator

This repository provides scripts and instructions to replicate the experiments from our paper, m4: A Learned Flow-level Network Simulator. It includes all necessary tools to reproduce the experimental results documented in Sections 5.2 to 5.6 of the paper.

Repository Structure

├── checkpoints/                    # Pre-trained model checkpoints
├── config/                         # Configuration files for training and testing m4
├── figs/                          # Generated figures and plots from experiments
├── High-Precision-Congestion-Control/ # HPCC repository for data generation
├── inference/                     # C++ inference engine for m4
├── parsimon-eval/                 # Scripts to reproduce m4 experiments and comparisons
├── results/                       # Experimental results and outputs
├── results_train/                 # Training results and outputs
├── testbed/                       # Testbed integration with ns-3, FlowSim, and m4 backends
│   ├── backends/                  # Backend implementations
│   │   ├── m4/                    # M4 ML-based simulator
│   │   ├── flowsim/               # FlowSim flow-level simulator
│   │   └── UNISON/                # NS3 packet-level simulator (UNISON)
│   ├── eval_test/                 # Test scenarios and results
│   │   ├── testbed/               # Real hardware ground truth (24 scenarios)
│   │   ├── m4/                    # M4 simulation results
│   │   ├── flowsim/               # FlowSim simulation results
│   │   └── ns3/                   # NS3 simulation results
│   ├── eval_train/                # Training data generation
│   ├── results/                   # Generated plots and accuracy summaries
│   ├── results_train/             # Training results and outputs
│   ├── run.py                     # Main runner script for simulations
│   ├── analyze.py                 # Results analysis and visualization
│   └── build.sh                   # Build script for all backends
├── SimAI/                         # SimAI integration with UNISON, flowSim, and m4 backends
│   ├── astra-sim-alibabacloud/    # Core simulation framework
│   │   ├── astra-sim/             # AstraSim system layer
│   │   │   ├── network_frontend/  # Network backend implementations
│   │   │   │   ├── ns3/           # UNISON (ns-3) packet-level simulator
│   │   │   │   ├── flowsim/       # flowSim analytical simulator
│   │   │   │   └── m4/            # m4 ML-based simulator
│   │   │   └── system/            # System components (routing, collective ops)
│   │   ├── extern/                # ns-3 source code
│   │   └── build.sh               # Build script for all backends
│   ├── example/                   # Example workloads and topologies
│   │   ├── gray_failures/         # 105 pre-generated gray failure topology files
│   │   │   └── gray_topo_N{2-16}_R{4-10}.txt  # Topology files for N degraded GPUs, R reduction factor
│   │   ├── microAllReduce.txt     # AllReduce collective workload
│   │   └── SimAI.conf             # ns-3 configuration
│   ├── scripts/                   # Build and run scripts
│   ├── results_gray_failures/     # Pre-computed gray failure results (315 simulations)
│   │   └── n_{N}_r_{R}_{backend}/ # Individual scenario results (ns3/flowsim/m4)
│   ├── gray_failure_run_sweep.py  # Gray failure sweep runner
│   ├── gray_failure_plot_results.py # Generate evaluation plots (6 figures)
│   └── gray_failure_topo_viz.py   # Topology visualization tool
├── util/                          # Utility functions for m4, including data loaders and ML model implementations
├── main_train.py                  # Main script for training and testing m4
└── plot_results.ipynb            # Jupyter notebook for visualizing results

Quick Reproduction

To quickly reproduce the results in the paper, follow these steps:

1. Clone the repository and initialize submodules:

git clone https://github.com/netiken/m4.git
cd m4
git submodule update --init --recursive

2. Set up Python environment:

Install uv (a fast Python package manager): Follow the installation guide at https://docs.astral.sh/uv/getting-started/installation/

Set up Python environment:

uv sync
source .venv/bin/activate  # Activate the virtual environment!

3. Reproduce paper results:

Section 5.2 (Testbed Integration): Run cd testbed && python analyze.py to generate testbed comparison plots from pre-computed results
Section 5.3 (SimAI Integration): Check pre-computed results in SimAI/results_gray_failures/ and run python SimAI/gray_failure_plot_results.py to generate paper figures
Sections 5.4-5.6 (m4 Evaluation): Run the notebook plot_results.ipynb to generate paper figures

Setup and Installation

Always activate the python environment before running any commands:

uv sync
source .venv/bin/activate  # Activate the virtual environment!

Install Rust and Cargo:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup install nightly
rustup default nightly

Install gcc-9:
```
sudo apt-get install gcc-9 g++-9
```

Set up ns-3 (for training dataset with packet traces) and UNISON (for fast simulation) for data generation:

cd High-Precision-Congestion-Control/UNISON-for-ns-3
./configure.sh
./ns3 run 'scratch/third mix/config_test.txt'
cd ../ns-3.39
./configure.sh
./ns3 run 'scratch/third mix/config_test.txt'

Running Experiments from Scratch

This section shows how to reproduce the experimental results from the paper using pre-trained models. The pre-trained checkpoints are available in the checkpoints/ directory.

Section 5.2: Testbed Integration

The testbed/ directory contains an integrated evaluation framework comparing three network simulation backends (m4, FlowSim, NS3) against real hardware measurements from a 12-node testbed running HERD, a key-value store application.

Build Backends

Build all three backends (requires GCC-9 and CUDA for M4):

cd testbed

# Build all backends
./build.sh all

# Or build individual backends
./build.sh m4       # M4 ML-based simulator (requires CUDA)
./build.sh flowsim  # FlowSim flow-level simulator
./build.sh ns3      # NS3 packet-level simulator (UNISON)

Run Simulations

Run simulations using the pre-existing testbed ground truth data:

# Run all backends (recommended)
python run.py all

# Or run individual backends
python run.py m4       # M4 ML-based simulator
python run.py flowsim  # FlowSim flow-level simulator
python run.py ns3      # NS3 packet-level simulator

# Use --process-only to skip simulation and only process existing results
python run.py all --process-only

Test Scenarios: 24 scenarios covering RDMA sizes (250KB-1000KB) × window sizes (1, 2, 4)

Results are saved in:

eval_test/testbed/ — Real hardware ground truth (24 scenarios)
eval_test/m4/ — M4 simulation outputs
eval_test/flowsim/ — FlowSim simulation outputs
eval_test/ns3/ — NS3 simulation outputs

Analyze Results

Generate evaluation plots and accuracy summaries:

python analyze.py

This produces:

results/m4-testbed-perflow.png — Per-flow FCT error CDF
results/m4-testbed-overall-window2.png — Application completion time comparison
results/accuracy_summary.txt — Summary statistics

Evaluation Metrics:

Per-flow FCT error: Absolute relative error for individual UD and RDMA flows
Application completion time error: End-to-end execution time accuracy

Section 5.3: SimAI Integration Experiments

The SimAI/ directory contains an integrated evaluation framework with three network simulation backends: UNISON (ns-3) , flowSim , and m4 .

Build Backends

Build all three backends (requires GCC-9):

cd SimAI
./scripts/build.sh -c ns3      # Build UNISON (ns-3) backend
./scripts/build.sh -c flowsim  # Build flowSim backend
./scripts/build.sh -c m4       # Build m4 backend (requires CUDA)

Gray Failure Evaluation

We evaluate all three backends under gray failure conditions—scenarios where network components experience partial performance degradation rather than complete failures. This mimics real-world datacenter issues like cable aging, thermal throttling, or partial switch failures.

Gray Failure Topologies:

The repository includes 105 pre-generated topologies in example/gray_failures/ covering a comprehensive parameter sweep:

N ∈ {2, 3, ..., 16}: Number of degraded GPUs (6%-50% of 32-GPU cluster)
R ∈ {4, 5, ..., 10}: Bandwidth reduction factor (degraded links operate at 1/R capacity, i.e., 75%-90% bandwidth loss)

Run Gray Failure Sweep:

Note: Pre-computed results for all 315 simulations (3 backends × 105 scenarios) are available in results_gray_failures/. Running the sweep script will overwrite the pre-computed results.

# Run all scenarios for a specific backend
python gray_failure_run_sweep.py ns3      # UNISON (packet-level ground truth)
python gray_failure_run_sweep.py flowsim  # flowSim (analytical)
python gray_failure_run_sweep.py m4       # m4 (ML-based, uses GPU auto-detection)

# Run a single scenario (N=8 degraded GPUs, R=4 bandwidth reduction)
python gray_failure_run_sweep.py m4 --n 8 --r 4

Visualize Results:

Generate all evaluation plots (CDFs, runtime comparison, MAE analysis, scatter plots):

python gray_failure_plot_results.py

This produces 6 figures in the SimAI/ directory:

gray_failure_errors.png — CDF of error magnitudes
gray_failure_signed_errors.png — CDF of signed errors (showing bias)
gray_failure_runtimes.png — Runtime comparison across backends
gray_failure_mae_by_n.png — Mean error vs. number of degraded GPUs
gray_failure_mae_by_r.png — Mean error vs. bandwidth reduction factor
gray_failure_scatter_n8.png — Completion time analysis for N=8

Visualize Network Topology:

Generate a visualization of the 32-GPU datacenter topology structure:

python gray_failure_topo_viz.py

This produces simai_topo_groups.png showing the hierarchical network topology with NVSwitch and rail switch layers.

Sections 5.4-5.6: m4 Evaluation Experiments

Reproduce m4's accuracy evaluation across diverse network scenarios using pre-trained models.

TODO for Anton: add the instructions to build and run the flowSim and m4.

Note: we provide 5 pre-generated scenarios for ns-3 in the parsimon-eval/expts/fig_8/eval_test_demo directory. Please run the following commands to run the full evaluation:

For Section 5.4 (Large-scale evaluation):

cd parsimon-eval/expts/fig_7
cargo run --release -- --root=./data --mixes spec/eval_test.mix.json ns3
cargo run --release -- --root=./data --mixes spec/eval_test.mix_large.json ns3
cargo run --release -- --root=./data --mixes spec/eval_test.mix.json mlsys
cargo run --release -- --root=./data --mixes spec/eval_test.mix_large.json mlsys

Results will be saved in the data directory.

For Section 5.5 (Flow-level evaluation):

cd parsimon-eval/expts/fig_8
cargo run --release -- --root=./eval_test --mixes spec/eval_test.mix.json --nr-flows 20000 ns3
cargo run --release -- --root=./eval_test --mixes spec/eval_test.mix.json --nr-flows 20000 mlsys

Results will be saved in the eval_test directory.

For Appendix 1 (Application completion time):

cd parsimon-eval/expts/fig_8
cargo run --release -- --root=./eval_app --mixes spec/eval_app.mix.json --nr-flows 20000 --enable-app ns3
cargo run --release -- --root=./eval_app --mixes spec/eval_app.mix.json --nr-flows 20000 --enable-app mlsys

Results will be saved in the eval_app directory.

Visualize Results

After completing the data generation and inference steps above, create the paper figures in the notebook plot_results.ipynb.

Training Your Own Model

This section shows how to train and test your own m4 model from scratch. Follow these steps in order:

Step 1: Prepare Training Data

Option A: Use Demo Data (Recommended for Quick Start) We provide pre-generated demo training data in the parsimon-eval/expts/fig_8/eval_train_demo directory.

Option B: Generate Full Training Dataset Or you can generate the complete training dataset yourself:

cd parsimon-eval/expts/fig_8
cargo run --release -- --root={dir_to_data} --mixes={config_for_sim_scenarios} --enable-train ns3

Example:

cargo run --release -- --root=./eval_train --mixes spec/eval_train.mix.json --nr-flows 2000 --enable-train ns3

Step 2: Train the Model

Train the neural network using the generated or demo training data:

Ensure you are in the correct Python environment.
Modify config/train_config.yaml if needed.

Run:

cd m4
uv run python main_train.py --train_config={path_to_config_file} --mode=train --dir_input={dir_to_save_data} --dir_output={dir_to_save_ckpts} --note={note}

Example:

# train on demo data
uv run python main_train.py
# train on the simulation data used in the paper
uv run python main_train.py --train_config=./config/train_config.yaml --mode=train --dir_input=./parsimon-eval/expts/fig_8/eval_train --dir_output=./results_train --note m4

Note: You can also use tensorboard to visualize the training process:

uv run tensorboard --logdir ./results_train/ --port 8009 --bind_all

Then, you can open the tensorboard in your browser following the instructions in the terminal.

Step 3: Test the Model

Validate your trained model using the training data to check performance:

Ensure you are in the correct Python environment.
Modify config/test_config.yaml if needed.

Run:

cd m4
uv run python main_train.py --mode=test --test_config={path_to_config_file} --dir_input={dir_to_save_data} --dir_output={dir_to_save_results} --note={note}

Example:

# test on the demo data
uv run python main_train.py --mode=test
# validate on the simulation data used in the paper
uv run python main_train.py --mode=test --test_config=./config/test_config.yaml --dir_input=./parsimon-eval/expts/fig_8/eval_train --dir_output=./results_train --note m4

Citation

If you find our work useful, please cite our paper:

@inproceedings{m4,
    author = {Li, Chenning and Zabreyko, Anton and Nasr-Esfahany, Arash and Zhao, Kevin and Goyal, Prateesh and Alizadeh, Mohammad and Anderson, Thomas},
    title = {m4: A Learned Flow-level Network Simulator},
    year = {2025},
}

Acknowledgments

We extend special thanks to Kevin Zhao and Thomas Anderson for their insights in the NSDI'23 paper Scalable Tail Latency Estimation for Data Center Networks. Their source code is available in Parsimon.

Contact

For further inquiries, reach out to Chenning Li at:
📧 [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 496 Commits
High-Precision-Congestion-Control @ cb98d73		High-Precision-Congestion-Control @ cb98d73
SimAI @ 274c4de		SimAI @ 274c4de
checkpoints		checkpoints
config		config
figs		figs
flowsim		flowsim
inference		inference
parsimon-eval @ 417888f		parsimon-eval @ 417888f
results		results
results_train/demo/version_0		results_train/demo/version_0
testbed		testbed
util		util
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
main_train.py		main_train.py
plot_data.ipynb		plot_data.ipynb
plot_m4.ipynb		plot_m4.ipynb
plot_results.ipynb		plot_results.ipynb
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

m4: A Learned Flow-level Network Simulator

Contents

Repository Structure

Quick Reproduction

Setup and Installation

Running Experiments from Scratch

Section 5.2: Testbed Integration

Build Backends

Run Simulations

Analyze Results

Section 5.3: SimAI Integration Experiments

Build Backends

Gray Failure Evaluation

Sections 5.4-5.6: m4 Evaluation Experiments

Visualize Results

Training Your Own Model

Step 1: Prepare Training Data

Step 2: Train the Model

Step 3: Test the Model

Citation

Acknowledgments

Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

netiken/m4

Folders and files

Latest commit

History

Repository files navigation

m4: A Learned Flow-level Network Simulator

Contents

Repository Structure

Quick Reproduction

Setup and Installation

Running Experiments from Scratch

Section 5.2: Testbed Integration

Build Backends

Run Simulations

Analyze Results

Section 5.3: SimAI Integration Experiments

Build Backends

Gray Failure Evaluation

Sections 5.4-5.6: m4 Evaluation Experiments

Visualize Results

Training Your Own Model

Step 1: Prepare Training Data

Step 2: Train the Model

Step 3: Test the Model

Citation

Acknowledgments

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages