A high-performance Julia implementation of the Single-Source Shortest Path (SSSP) algorithm introduced in:
Ran Duan, Jiayi Mao, Xiao Mao, Xinkai Shu, Longhui Yin,
"Breaking the Sorting Barrier for Directed Single-Source Shortest Paths", STOC 2025.
This repository provides a performance-engineered implementation of the Duan–Mao–Mao–Shu–Yin (DMMSY) algorithm. By leveraging advanced architectural optimizations such as 8-way Instruction-Level Parallelism (ILP), hybrid priority tracking, and cache-interleaved memory layouts, this implementation demonstrates significant practical advantages over traditional heap-based Dijkstra's algorithm, achieving over 200x speedup on large-scale graphs when using high-performance compiler flags.
The DMMSY algorithm represents the first deterministic breakthrough to exceed the
-
Complexity Shift: It reduces the logarithmic factor from
$O(\log n)$ to$O(\log^{2/3} n)$ . For a graph with$10^9$ nodes, this is a ~3x reduction in sorting overhead. - Interval-Pivot Mechanism: Unlike Dijkstra which sorts nodes individually, DMMSY uses a recursive subproblem decomposition based on interval pivots. This allows it to relax edges without maintaining a strict global priority queue for every extraction.
Traditional production engines often use Contraction Hierarchies (CH) or A* to achieve sub-millisecond queries. However, these rely on expensive preprocessing (
DMMSY is the superior choice for Dynamic Graphs where:
- Topology/Weights Change Rapidly: The costs of rebuilding a hierarchy (20-40 minutes for road networks) cannot be amortized.
- Zero Preprocessing luxury: Real-time systems like Financial Market Arbitrage or Software-Defined Networking (SDN) require immediate SSSP computation on a raw, mutating graph.
-
Theoretical Breakthrough: Implements the first deterministic algorithm to break the
$O(m \log n)$ sorting barrier for directed graphs, achieving$O(m \log^{2/3} n)$ complexity. - Hardware-Aware Design: Features unrolled priority tracking and software prefetching to maximize pipeline occupancy.
-
Selective Memory Reset: Utilizes dirty-index tracking to eliminate
$O(n)$ overhead during recursive subproblem resets. - Adaptive Strategy: Employs a hybrid tracking system (Bitmap + Heap) and dynamic threshold feedback to maintain peak efficiency.
- Robust Verification: Validated against reference Dijkstra implementations across linear, diamond, cycle, and hub topologies.
The algorithm's ability to handle massive sparse graphs without preprocessing creates a new paradigm for specific industrial domains:
| Sector | Application | DMMSY Advantage |
|---|---|---|
| Finance | Arbitrage Detection | Faster negative-cycle detection in violently dynamic FX/Crypto markets. |
| Networking | SDN & OSPF | Reduced path convergence time in datacenter fabrics during link failures. |
| Logistics | Live Traffic Routing | Faster fallback SSSP when emergency closures invalidate precomputed hierarchies. |
| EDA | Chip Design | Iterative timing analysis on circuit DAGs with millions of gate nodes. |
| Social Tech | Influence Analysis | Real-time recommendation ranking on billion-node sparse social graphs. |
Tests conducted on a modern x86_64 architecture with specialized selective reset and GC-managed timing. The results below showcase the stable performance delta under --math-mode=fast.
Note
The algorithm achieves its maximum performance delta at scale (250k–1M+ nodes), where the reduction in sorting overhead and enhanced memory locality most effectively mitigate the cache-bottlenecks of standard priority queues. At 1M nodes, DMMSY outperforms Dijkstra by over 200x.
-
Dual-Level Interleaving:
-
Graph Layer: Neighbor IDs and weights are co-located in an
Edgestruct to ensure single-cache-line fetches during relaxation. -
Heap Layer: Distance values and node IDs are co-located in a
HeapNodestruct within the 4-ary heap, minimizing cache thrashing during swaps.
-
Graph Layer: Neighbor IDs and weights are co-located in an
- 8-way ILP Unrolling: Critical relaxation loops are manually unrolled to saturate execution units and remove data dependency chains.
-
Hybrid Block Search: Swaps between
$O(1)$ bitmap scanning and$O(\log n)$ heap-based selection based on subproblem characteristics. - Adaptive Thresholding: Dynamically adjusts recursive propagation thresholds based on real-time extraction efficiency.
-
Quickselect-based Pivots: Utilizes
$O(n)$ partial sorting for pivot selection instead of full$O(n \log n)$ sorts.
DMMSY-SSSP.jl: Core highly-optimized implementation.CSRGraph.jl: High-performance Compressed Sparse Row (CSR) graph implementation.Dijkstra.jl: Reference implementation used for correctness and performance baselines.DMMSY_Research.jl: Baseline research implementation for algorithmic comparison.bench_report_collector.jl: Main high-precision benchmarking and reporting driver.benchmark_results.html: Visualization for performance data.benchmark_data.csv: Raw timing metrics.
Ensure you have Julia installed, then clone this repository and include the source files:
include("Common.jl")
include("CSRGraph.jl")
include("Dijkstra.jl")
include("DMMSY-SSSP.jl")
using .Common, .CSRGraphModule, .DijkstraModule, .DMMSYSSSP
# Generate a synthetic graph
g = random_graph(10000, 50000, 100.0)
# Compute shortest paths from source vertex 1
dists, preds = ssp_duan(g, 1)To verify the implementation against reference Dijkstra across multiple topologies:
julia test_sssp.jlTo generate a full performance report:
julia bench_report_collector.jlFor maximum performance, it is highly recommended to run Julia with the --math-mode=fast flag. This enables aggressive floating-point optimizations, SIMD vectorization, and reordering of operations that are critical for the edge relaxation loops.
# Standard Execution
julia bench_report_collector.jl
# High-Performance Mode (Recommended for Benchmarking)
julia --math-mode=fast bench_report_collector.jlImportant
Enabling --math-mode=fast can increase speedups from 1.1x to over 220x on specific graph topologies by allowing the compiler to fully utilize the processor's vector units and instruction pipelines.
We welcome contributions to further improve the performance of this implementation. Potential areas for exploration include:
- Multithreading: Parallelizing the recursive subproblem decomposition.
- SIMD: Leveraging AVX-512/AMX instructions for even more aggressive relaxation unrolling.
- GPU Acceleration: Offloading the blocked priority queue logic to massive parallel hardware.
- Duan, R., Mao, J., Mao, X., Shu, X., & Yin, L. (2025). "Breaking the Sorting Barrier for Directed Single-Source Shortest Paths". Proceedings of the 57th Annual ACM Symposium on Theory of Computing (STOC 2025).
- arXiv:2504.17033
- ACM Digital Library
- MPI-INF repository
This project is dual-licensed under the MIT License and Apache License 2.0.
