Skip to content

[WIP] Speedup CommutativeOptimization pass#15504

Open
alexanderivrii wants to merge 1 commit intoQiskit:mainfrom
alexanderivrii:speedup-commutative-optimization
Open

[WIP] Speedup CommutativeOptimization pass#15504
alexanderivrii wants to merge 1 commit intoQiskit:mainfrom
alexanderivrii:speedup-commutative-optimization

Conversation

@alexanderivrii
Copy link
Copy Markdown
Member

Summary

By experimenting with various larger circuits with 100-2000 qubits (including QFT, QAOA, Trotterized Hamiltonian evolution, etc.) in an effort to replace CommutativeCancellation by CommutativeOptimization in the transpiler pipeline (see #15464), I saw that CommutativeOptimization is about 2x faster than CommutativeCancellation, and yet is still considerably slow. There seem to be two main bottlenecks in CommutativeOptimization: (1) the commutation checker can be quite slow and (2) the pass does a humongous number of trivial commutation checks where the gates have disjoint supports. This is something that we were aware about but so far did not find a good way to address this problem.

The current PR attempts to improve the runtime by computing a u64 for each node's qubits/clbits, and in particular when the two masks are disjoint (that is, the bit-and mask1 & mask2 = 0), the nodes have necessarily disjoint supports and hence trivially commute. This catches the majority (though not all) trivial commutation checks and improves the runtimes about 2x-4x times, and actually the improvement gets larger on larger circuits).

Here is some quick profiling data on my laptop:

multiplier with 216 qubits: main = 1.76 s, this PR = 0.93s,
multiplier with 512 qubits: main = 41.08 s, this PR = 14.39s,
QFT with 216 qubits: main = 1.83 s, this PR = 0.60s,
QFT with 512 qubits: main = 62.7 s, this PR = 14.19s,

I am keeping this WIP for now to see if someone can suggest a better approach that does not require introducing masks.

@alexanderivrii alexanderivrii requested a review from a team as a code owner January 5, 2026 10:54
@qiskit-bot
Copy link
Copy Markdown
Collaborator

One or more of the following people are relevant to this code:

  • @Qiskit/terra-core

@coveralls
Copy link
Copy Markdown

Pull Request Test Coverage Report for Build 20713139470

Details

  • 25 of 25 (100.0%) changed or added relevant lines in 1 file are covered.
  • 16 unchanged lines in 4 files lost coverage.
  • Overall coverage increased (+0.004%) to 88.32%

Files with Coverage Reduction New Missed Lines %
crates/qasm2/src/expr.rs 1 93.82%
crates/transpiler/src/passes/commutative_optimization.rs 1 96.28%
crates/qasm2/src/lex.rs 5 91.77%
crates/circuit/src/parameter/symbol_expr.rs 9 72.94%
Totals Coverage Status
Change from base Build 20695381143: 0.004%
Covered Lines: 96747
Relevant Lines: 109542

💛 - Coveralls

Comment on lines +386 to +391
/// Computes a `u64` mask for a given node's qubits and clbits.
///
/// If the circuit has both qubits and clbits, the mask has
/// 32 low bits for qubits and 32 high bits for clbits.
/// When the circuit has no clbits, all of the 64 bits are used
/// for qubits.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... could we get away with using a smaller u32 for the masks here as they represent qargs/cargs.

Also, what happens if the circuit has more than 64 bits total. I.e. if any of the indices exceeds 64? I'm assuming the value would overflow making your mask an invalid representation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general question out of curiosity, would it make sense to add the first instruction’s qargs/cargs to a set, then check whether any of the second instruction’s qargs/cargs are already in it? Or would that be too inefficient?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raynelfss, we map the index of a qubit to index % 64, so there is no overflow (implemented as index & 63 since this is a bit faster than taking modulus). Such masks are quite common in SAT-solving (for fun, see e.g. section 4.2 in https://cca.informatik.uni-freiburg.de/papers/EenBiere-SAT05.pdf).

Reducing the size of the mask leads to more false negatives: let's say that the circuit has no classical bits; then CX(1, 10) is found disjoint from CX(42, 43) if mask of size 64 is used but not if mask of size 32 is used. Experimentally, 64 is better on the benchmarks I tried.

If the circuit has both qubits and clbits, I am only using 32 bits for each. Possibly I could give more bit-width to qubits (e.g. 48 bits to qubits and 16 bits to clbit), but I have not tried that.

@Shobhit21287, I actually tried similar things: (1) Storing for each node the sorted lists of its qubits, which then enables checking whether the two sorted lists are disjoint in time linear in the sum of the two lists. This did not help since in my examples most of the gates had very few qubits/clbits. (2) Storing fixedbitset (that was absolutely horrible on large benchmarks).

At some point @jakelishman mentioned that by cleverly iterating over the DAG it might be possible to avoid considering gates with disjoint supports whatsoever, but I don't see this.

We also have some code in commutation_analysis.rs which however works only for 1 and 2 qubit gates, and requires computing additional (often redundant) information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants