[WIP] Speedup CommutativeOptimization pass#15504
[WIP] Speedup CommutativeOptimization pass#15504alexanderivrii wants to merge 1 commit intoQiskit:mainfrom
CommutativeOptimization pass#15504Conversation
|
One or more of the following people are relevant to this code:
|
Pull Request Test Coverage Report for Build 20713139470Details
💛 - Coveralls |
| /// Computes a `u64` mask for a given node's qubits and clbits. | ||
| /// | ||
| /// If the circuit has both qubits and clbits, the mask has | ||
| /// 32 low bits for qubits and 32 high bits for clbits. | ||
| /// When the circuit has no clbits, all of the 64 bits are used | ||
| /// for qubits. |
There was a problem hiding this comment.
Hmm... could we get away with using a smaller u32 for the masks here as they represent qargs/cargs.
Also, what happens if the circuit has more than 64 bits total. I.e. if any of the indices exceeds 64? I'm assuming the value would overflow making your mask an invalid representation.
There was a problem hiding this comment.
A general question out of curiosity, would it make sense to add the first instruction’s qargs/cargs to a set, then check whether any of the second instruction’s qargs/cargs are already in it? Or would that be too inefficient?
There was a problem hiding this comment.
@raynelfss, we map the index of a qubit to index % 64, so there is no overflow (implemented as index & 63 since this is a bit faster than taking modulus). Such masks are quite common in SAT-solving (for fun, see e.g. section 4.2 in https://cca.informatik.uni-freiburg.de/papers/EenBiere-SAT05.pdf).
Reducing the size of the mask leads to more false negatives: let's say that the circuit has no classical bits; then CX(1, 10) is found disjoint from CX(42, 43) if mask of size 64 is used but not if mask of size 32 is used. Experimentally, 64 is better on the benchmarks I tried.
If the circuit has both qubits and clbits, I am only using 32 bits for each. Possibly I could give more bit-width to qubits (e.g. 48 bits to qubits and 16 bits to clbit), but I have not tried that.
@Shobhit21287, I actually tried similar things: (1) Storing for each node the sorted lists of its qubits, which then enables checking whether the two sorted lists are disjoint in time linear in the sum of the two lists. This did not help since in my examples most of the gates had very few qubits/clbits. (2) Storing fixedbitset (that was absolutely horrible on large benchmarks).
At some point @jakelishman mentioned that by cleverly iterating over the DAG it might be possible to avoid considering gates with disjoint supports whatsoever, but I don't see this.
We also have some code in commutation_analysis.rs which however works only for 1 and 2 qubit gates, and requires computing additional (often redundant) information.
Summary
By experimenting with various larger circuits with 100-2000 qubits (including QFT, QAOA, Trotterized Hamiltonian evolution, etc.) in an effort to replace
CommutativeCancellationbyCommutativeOptimizationin the transpiler pipeline (see #15464), I saw thatCommutativeOptimizationis about 2x faster thanCommutativeCancellation, and yet is still considerably slow. There seem to be two main bottlenecks inCommutativeOptimization: (1) the commutation checker can be quite slow and (2) the pass does a humongous number of trivial commutation checks where the gates have disjoint supports. This is something that we were aware about but so far did not find a good way to address this problem.The current PR attempts to improve the runtime by computing a
u64for each node's qubits/clbits, and in particular when the two masks are disjoint (that is, the bit-andmask1 & mask2 = 0), the nodes have necessarily disjoint supports and hence trivially commute. This catches the majority (though not all) trivial commutation checks and improves the runtimes about 2x-4x times, and actually the improvement gets larger on larger circuits).Here is some quick profiling data on my laptop:
I am keeping this WIP for now to see if someone can suggest a better approach that does not require introducing masks.