Skip to content

customsolve.py: incorrect number of arguments to _find_pivot. #542

@rileyjmurray

Description

@rileyjmurray

Here's a code snippet in pygsti/optimize/customsolve.py on the head of master.

smbuf1b = _np.zeros(1, _np.int64)
smbuf2 = _np.zeros(2, _np.int64)
smbuf3 = _np.zeros(3, _np.int64)
for icol in range(a.shape[1]):
# Step 1: find the index of the row that is the best pivot.
# each proc looks for its best pivot (Note: it should not consider rows already pivoted on)
potential_pivot_indices = all_row_indices[potential_pivot_mask]
ibest_global, ibest_local, h, k = _find_pivot(a, b, icol, potential_pivot_indices, my_row_slice,
shared_floats, shared_ints, resource_alloc, comm, host_comm,
smbuf1, smbuf2, smbuf3, host_index_buf, host_val_buf)

The smbuf1b variable is never used. It looks like it should be passed as an argument to _find_pivot, after smbuf1, per the signature below (variable names in the signature are just buf1 and buf1b):

def _find_pivot(a, b, icol, potential_pivot_inds, my_row_slice, shared_floats, shared_ints,
resource_alloc, comm, host_comm, buf1, buf1b, buf2, buf3, best_host_indices, best_host_vals):

As a consequence of smbuf1b not being passed to _find_pivot, we're calling _find_pivot with only 15 positional arguments when it requires 16.

The appearance of this bug is uncommon because there's a dedicated code path for handling linear solves with matrices of order < 10,000 (below) that gets triggered in our testing and even in most large GST runs:

if comm.size < proc_threshold and a.shape[1] < 10000:
# We're not exactly sure where scipy is better, but until we speed up / change gaussian-elim
# alg the scipy alg is much faster for small numbers of procs and so should be used unless
# A is too large to be gathered to the root proc.
global_a, a_shm = ari.gather_jtj(a, return_shared=True)
global_b, b_shm = ari.gather_jtf(b, return_shared=True)
#global_a = ari.gather_jtj(a)
#global_b = ari.gather_jtf(b)
if comm.rank == 0:
try:
global_x = _scipy.linalg.solve(global_a, global_b, assume_a='pos')
ok_buf[0] = 1 # ok
except _scipy.linalg.LinAlgError as e:
ok_buf[0] = 0 # failure!
err = e
else:
global_x = None
err = _scipy.linalg.LinAlgError("Linear solver fail on root proc!") # just in case...
comm.Bcast(ok_buf, root=0)
if ok_buf[0] == 0:
_smt.cleanup_shared_ndarray(a_shm)
_smt.cleanup_shared_ndarray(b_shm)
raise err # all procs must raise in sync
ari.scatter_x(global_x, x)
_smt.cleanup_shared_ndarray(a_shm)
_smt.cleanup_shared_ndarray(b_shm)
return

Possible ways to move forward:

  1. Temporary. Just change the dimension threshold to 100,000 instead of 10,000. Storing one double-precision matrix of this size requires ~80GB. The overhead of the Python call means this will require 160 GB of "free" RAM to run. That's a reasonable amount for modern compute servers.
  2. Try and fix the bug in the custom Gaussian elimination code. Have the branch on matrix size check a user-modifiable constant instead of a hard coded value. Some of our tests can run this code where that matrix size threshold is quite small.
  3. Ambitious. Create a new ArrayInterface class that relies on Dask arrays. Replace the custom Gaussian elimination code with calls to Dask's distributed linear algebra functions: https://docs.dask.org/en/stable/_modules/dask/array/linalg.html.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugA bug or regression

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions