Skip to content

Tree merge performance when reparenting causes lock timeouts and OOM #8029

@grantfitzsimmons

Description

@grantfitzsimmons

I have 3 Andrenidae nodes in the 'Hymenoptera (recent)' collection in a database named naturkundemuseum_test_2026_04_27 (available here).

@bronwyncombs did some research on this after there were multiple "Lock wait timeout exceeded" messages received:

Andrenidae (0,0) | ID 341182 | created 2015-08-13 | source catalog of life

49 child genus records (all with children)
query family > name, family>ID family>ts created and modified = 3042 results/children
Andrenidae (0,8602) | ID 887847 | created 2026-04-24 | source none

36 child genus records (23 with children)
same query structure = 370 results/children
Andrenidae (1398) | ID 887983 | created 2026-04-22 | source local_db

0 child nodes
same query structure = 1 result

merge 341182 into 887847 = fail 504 timeout
Merge 887847 into 341182 = fail 504 timeout
merge 341182 into 887983 = fail (fast; 500 operational error try starting transaction)
merge 887847 into 887983 = fail (slower; same operational message)
merge 887983 into 887847 = SUCCESS
merge 887847 into 341182? = fail
merge 341182 into 887847? = fail

Error:

{
  "type": "invalidResponseCode",
  "statusText": [
    "Invalid response code 500.",
    "\n    This error may indicate a misconfiguration or a bug in Specify. Please\n    double check your configuration and report this issue.\n  ",
    "Response:"
  ],
  "responseText": "{\"exception\": \"OperationalError\", \"message\": \"1205\", \"data\": \"Lock wait timeout exceeded; try restarting transaction\", \"traceback\": \"Traceback (most recent call last):\\n  File \\\"/opt/specify7/ve/lib/python3.12/site-packages/django/db/backends/utils.py\\\", line 89, in _execute\\n    return self.cursor.execute(sql, params)\\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n  File \\\"/opt/specify7/ve/lib/python3.12/site-packages/django/db/backends/mysql/base.py\\\", line 75, in execute\\n    return self.cursor.execute(query, args)\\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n  File \\\"/opt/specify7/ve/lib/python3.12/site-packages/MySQLdb/cursors.py\\\", line 206, in execute\\n    res = self._query(query)\\n          ^^^^^^^^^^^^^^^^^^\\n  File \\\"/opt/specify7/ve/lib/python3.12/site-packages/MySQLdb/cursors.py\\\", line 319, in _query\\n    db.query(q)\\n  File \\\"/opt/specify7/ve/lib/python3.12/site-packages/MySQLdb/connections.py\\\", line 254, in query\\n    _mysql.connection.query(self, query)\\nMySQLdb.OperationalError: (1205, 'Lock wait timeout exceeded; try restarting transaction')\\n\\nThe above exception was the direct cause of the following exception:\\n\\nTraceback (most recent call last):\\n  File \\\"/opt/specify7/ve/lib/python3.12/site-packages/django/core/handlers/base.py\\\", line 197, in _get_response\\n    response = wrapped_callback(request, *callback_args, **callback_kwargs)\\n               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n  File \\\"/opt/specify7/specifyweb/specify/views.py\\\", line 32, in wrapped\\n    return view(request, *args, **kwargs)\\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n  File \\\"/opt/specify7/ve/lib/python3.12/site-packages/django/views/decorators/http.py\\\", line 43, in inner\\n    return func(request, *args, **kwargs)\\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n  File \\\"/usr/lib/python3.12/contextlib.py\\\", line 81, in inner\\n    return func(*args, **kwds)\\n           ^^^^^^^^^^^^^^^^^^^\\n  File \\\"/opt/specify7/specifyweb/backend/trees/views.py\\\", line 63, in wrapper\\n    mutation(*args, **kwargs)\\n  File \\\"/opt/specify7/specifyweb/backend/trees/views.py\\\", line 292, in merge\\n    extras.merge(node, target, request.specify_user_agent)\\n  File \\\"/opt/specify7/specifyweb/backend/trees/extras.py\\\", line 335, in merge\\n    target = model.objects.select_for_update().get(id=into.id)\\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n  File \\\"/opt/specify7/ve/lib/python3.12/site-packages/django/db/models/query.py\\\", line 635, in get\\n    num = len(clone)\\n          ^^^^^^^^^^\\n  File \\\"/opt/specify7/ve/lib/python3.12/site-packages/django/db/models/query.py\\\", line 382, in __len__\\n    self._fetch_all()\\n  File \\\"/opt/specify7/ve/lib/python3.12/site-packages/django/db/models/query.py\\\", line 1886, in _fetch_all\\n    self._result_cache = list(self._iterable_class(self))\\n                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n  File \\\"/opt/specify7/ve/lib/python3.12/site-packages/django/db/models/query.py\\\", line 93, in __iter__\\n    results = compiler.execute_sql(\\n              ^^^^^^^^^^^^^^^^^^^^^\\n  File \\\"/opt/specify7/ve/lib/python3.12/site-packages/django/db/models/sql/compiler.py\\\", line 1562, in execute_sql\\n    cursor.execute(sql, params)\\n  File \\\"/opt/specify7/ve/lib/python3.12/site-packages/django/db/backends/utils.py\\\", line 67, in execute\\n    return self._execute_with_wrappers(\\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n  File \\\"/opt/specify7/ve/lib/python3.12/site-packages/django/db/backends/utils.py\\\", line 80, in _execute_with_wrappers\\n    return executor(sql, params, many, context)\\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n  File \\\"/opt/specify7/ve/lib/python3.12/site-packages/django/db/backends/utils.py\\\", line 84, in _execute\\n    with self.db.wrap_database_errors:\\n  File \\\"/opt/specify7/ve/lib/python3.12/site-packages/django/db/utils.py\\\", line 91, in __exit__\\n    raise dj_exc_value.with_traceback(traceback) from exc_value\\n  File \\\"/opt/specify7/ve/lib/python3.12/site-packages/django/db/backends/utils.py\\\", line 89, in _execute\\n    return self.cursor.execute(sql, params)\\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n  File \\\"/opt/specify7/ve/lib/python3.12/site-packages/django/db/backends/mysql/base.py\\\", line 75, in execute\\n    return self.cursor.execute(query, args)\\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n  File \\\"/opt/specify7/ve/lib/python3.12/site-packages/MySQLdb/cursors.py\\\", line 206, in execute\\n    res = self._query(query)\\n          ^^^^^^^^^^^^^^^^^^\\n  File \\\"/opt/specify7/ve/lib/python3.12/site-packages/MySQLdb/cursors.py\\\", line 319, in _query\\n    db.query(q)\\n  File \\\"/opt/specify7/ve/lib/python3.12/site-packages/MySQLdb/connections.py\\\", line 254, in query\\n    _mysql.connection.query(self, query)\\ndjango.db.utils.OperationalError: (1205, 'Lock wait timeout exceeded; try restarting transaction')\\n\"}"
}

After adding some logging in a one-off branch (tree-logging), I found the following when merging :

The merge processed 10 children (ids 360761-360769) in ~280 seconds total before the worker was killed. The key finding:

[TIMER] merge: recursive merge child id=341763 -> matched id=887848 took 280.361 seconds

Per-child cost breakdown

Each child takes ~6-7 seconds to reparent, broken down as:

Step Time What it does
select_for_update 0.001s Lock the child row
open_interval: shift right 1.4-2.1s UPDATE taxon SET nodenumber=nodenumber+1 WHERE nn>198321 — shifts all nodes right to make room
open_interval: expand containing 0.4-0.7s Update parent's highest child node number
move_interval 0.4-0.7s Move the child's interval to the new position
close_interval: reduce containing 0.4-0.5s Update old parent's highest child node number
close_interval: shift left 1.4-1.7s UPDATE taxon SET nodenumber=nodenumber-1 WHERE nn>199024 — shifts all nodes left to close gap
set_fullnames 0.4-0.5s 17-level self-join UPDATE to rebuild full names
validate_tree_numbering included Full table scan to validate tree integrity
Total ~6-7s

The set_fullnames SQL is massive

The SQL query is a 17-level self-join on taxon with 17 joins to taxontreedefitem, building the full name using deeply nested concat_ws and if expressions. It runs once per child.

Root cause: O(N²) complexity

For a merge with N children, the merge does:

  • 5 × N bulk UPDATEs on the taxon table (open_interval ×2, move_interval, close_interval ×2)
  • N set_fullnames queries (each a 17-level self-join)
  • N validate_tree_numbering queries (full table scan)

With 10 children taking 280 seconds, merging a node with 100 children would take ~2800 seconds (~47 minutes).

The actual timeout

The worker was killed by gunicorn's SIGKILL with message Perhaps out of memory? — the merge was consuming too much memory, likely from holding all the select_for_update locks and the large result sets from the 17-level self-joins.

Immediate workarounds

  1. Increase gunicorn timeout — already -t 300 in Dockerfile, but in this case the worker was killed by OOM, not timeout
  2. Increase MySQL innodb_lock_wait_timeoutSET GLOBAL innodb_lock_wait_timeout = 120;
  3. Increase nginx proxy_read_timeout — maybe add proxy_read_timeout 300s; to nginx config?

Long-term fix needed

The merge algorithm needs to be refactored to:

  1. Batch all child reparenting — change all children's parentid in a single UPDATE instead of one at a time
  2. Run interval operations once — shift intervals once for the batch, not per child
  3. Run set_fullnames once — update all affected nodes in a single query
  4. Run validation once — at the end, not after each child

This would reduce the complexity from O(N) to O(1) for the bulk operations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    2 - TreesIssues that are related to the tree system and related functionalities.4 - PerformanceIssues related to performance, concurrency, and optimization

    Type

    No fields configured for Bug.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions