Skip to content

Parallel parsing doesn't balance workloads well when some files are much larger than others #1808

@MikePopoloski

Description

@MikePopoloski

Discussed in #1793

Originally posted by acarlson1029 April 16, 2026
Currently in SourceLoader.cpp there is the following code for parallelizing the the parsing phase:

        // Load all source files that were specified on the command line
        // or via library maps.
        pool->detach_loop(size_t(0), fileEntries.size(), [&](size_t i) {
            loadResults[i] = loadAndParse(fileEntries[i], optionBag, srcOptions, i);
        });
        pool->wait();

From the definition of detach_loop:

Parallelize a loop by automatically splitting it into blocks and submitting each block separately to the queue, with the specified priority. The loop function takes one argument, the loop index, and it is called exactly once per index, but many times per block. Does not return a BS::multi_future, so the user must use wait() or some other method to ensure that the loop finishes executing, otherwise bad things will happen.

Current behavior

This splits the files into fixed chunks that are parsed by each thread. For large file lists with large individual files it was observed that one thread became the long pole during parsing, leaving all of the other threads idle after they've finished their blocks.

See the trace generated as the baseline with a real file list. This ran with 64 threads and took 45 seconds. The typical performance depends on the distribution of files between the blocks. If one block has many large files (e.g. parsing generated code) it may take substantially longer.

slang_trace_baseline_sized.json

Updated behavior

This can be optimized by using detach_sequence instead.

Submit a sequence of tasks enumerated by indices to the queue, with the specified priority. The sequence function takes one argument, the task index, and will be called once per index. Does not return a BS::multi_future, so the user must use wait() or some other method to ensure that the sequence finishes executing, otherwise bad things will happen.

This allows each thread to pull from the available tasks as they finish, preventing any thread from staying idle.

See the trace generated with this optimization enabled. This ran with 64 threads and took only 33s.

slang_trace_optimized_sized.json

NOTE: I haven't benchmarked this with a small list of files to see if it causes a performance regression.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions