-
Notifications
You must be signed in to change notification settings - Fork 9
Description
I have a pretty large (>4 GB) esgpull database (esgpull.db) that contains data pulled in from an old synda database long ago (using the synda db migration tools) as well as data/queries unique to esgpull itself.
When running a query for CMIP6 data, I reached the esgpull update step and then encountered a failure that I've never seen before.
Traceback (most recent call last):
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1936, in _exec_single_context
self.dialect.do_executemany(
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 921, in do_executemany
cursor.executemany(statement, parameters)
sqlite3.IntegrityError: UNIQUE constraint failed: file.file_id
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/esgpull/tui.py", line 180, in logging
yield
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/esgpull/cli/update.py", line 184, in update
with esg.db.commit_context():
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/contextlib.py", line 144, in __exit__
next(self.gen)
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/esgpull/database.py", line 96, in commit_context
self.session.commit()
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 2017, in commit
trans.commit(_to_root=True)
File "<string>", line 2, in commit
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/orm/state_changes.py", line 139, in _go
ret_value = fn(self, *arg, **kw)
^^^^^^^^^^^^^^^^^^^^
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 1302, in commit
self._prepare_impl()
File "<string>", line 2, in _prepare_impl
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/orm/state_changes.py", line 139, in _go
ret_value = fn(self, *arg, **kw)
^^^^^^^^^^^^^^^^^^^^
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 1277, in _prepare_impl
self.session.flush()
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 4341, in flush
self._flush(objects)
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 4476, in _flush
with util.safe_reraise():
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/util/langhelpers.py", line 146, in __exit__
raise exc_value.with_traceback(exc_tb)
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 4437, in _flush
flush_context.execute()
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/orm/unitofwork.py", line 466, in execute
rec.execute(self)
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/orm/unitofwork.py", line 642, in execute
util.preloaded.orm_persistence.save_obj(
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/orm/persistence.py", line 93, in save_obj
_emit_insert_statements(
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/orm/persistence.py", line 1048, in _emit_insert_statements
result = connection.execute(
^^^^^^^^^^^^^^^^^^^
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1418, in execute
return meth(
^^^^^
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/sql/elements.py", line 515, in _execute_on_connection
return connection._execute_clauseelement(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1640, in _execute_clauseelement
ret = self._execute_context(
^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1846, in _execute_context
return self._exec_single_context(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1986, in _exec_single_context
self._handle_dbapi_exception(
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 2353, in _handle_dbapi_exception
raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1936, in _exec_single_context
self.dialect.do_executemany(
File "/path/to/miniconda3/envs/esgdownload/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 921, in do_executemany
cursor.executemany(statement, parameters)
sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) UNIQUE constraint failed: file.file_id
In my investigation, I looked in update.py, and discovered that the unregistered_files variable, a list of File objects set inside with esg.db.commit_context()contained a small number of File objects with a duplicated file_id field. Since the code sets the file_id field as UNIQUE for the database insertion, this triggers the above traceback regarding database integrity. The insertion step within the database session does not contain an SQL parameter to ignore duplicated entries, which is probably set by design, yet at the same time got the error in question to trigger. The tricky part is, the File objects themselves may indeed be unique, even though the same File.file_id was shared across the object in limited quantity.
I crudely hacked the script to remove the File objects in the list with the same File.file_id
raw_unregistered_files = [f for f in qf.files if f not in esg.db]
ruf_ids = []
dup_ids = []
for ruf in raw_unregistered_files:
ruf_ids.append(ruf.file_id)
for ruf_id in ruf_ids:
if ruf_ids.count(ruf_id) > 1:
dup_ids.append(ruf_id)
unregistered_files = [f for f in raw_unregistered_files if f.file_id not in dup_ids]
though I'm sure there's a better way to do it.