Skip to content

Conversation

eddyashton
Copy link
Member

Surprised to discover that this if (NOT TSAN) block gates so many tests. Believe many should now work - let's see what the CI says.

@eddyashton eddyashton added the run-long-test Run Long Test job label Nov 7, 2024
@eddyashton
Copy link
Member Author

The failures are so verbose we need to look at the raw logs, but on the first run these are the failing tests:

2024-11-07T15:18:36.9778496Z The following tests FAILED:
2024-11-07T15:18:36.9779075Z 	 40 - recovery_test_cft_api_0 (Failed)
2024-11-07T15:18:36.9779535Z 	 41 - recovery_test_cft_api_1 (Failed)
2024-11-07T15:18:36.9779971Z 	 42 - recovery_test_suite (Failed)
2024-11-07T15:18:36.9780411Z 	 43 - reconfiguration_test_suite (Failed)
2024-11-07T15:18:36.9780881Z 	 44 - regression_test_suite (Failed)
2024-11-07T15:18:36.9781299Z 	 45 - full_test_suite (Failed)
2024-11-07T15:18:36.9781683Z 	 47 - commit_latency (Failed)
2024-11-07T15:18:36.9782045Z 	 50 - auth (Failed)
2024-11-07T15:18:36.9782386Z 	 52 - governance_test (Failed)
2024-11-07T15:18:36.9782758Z 	 53 - jwt_test (Failed)
2024-11-07T15:18:36.9783289Z 	 55 - e2e_logging_cft (Failed)
2024-11-07T15:18:36.9783689Z 	 59 - e2e_logging_http2 (Failed)
2024-11-07T15:18:36.9784172Z 	 60 - membership_api_0 (Failed)
2024-11-07T15:18:36.9784565Z 	 66 - lts_compatibility (Failed)
2024-11-07T15:18:36.9784948Z 	 70 - acme_endorsement_test (Failed)

I've got stacks for some missing mutexes and mutex inversions around the snapshotter, which is likely the recovery tests. Will investigate the others.

@eddyashton
Copy link
Member Author

First change knocks out of a few of those failures already:

2024-11-07T16:20:07.7189192Z 	 40 - recovery_test_cft_api_0 (Failed)
2024-11-07T16:20:07.7189554Z 	 41 - recovery_test_cft_api_1 (Failed)
2024-11-07T16:20:07.7189916Z 	 44 - regression_test_suite (Failed)
2024-11-07T16:20:07.7190245Z 	 45 - full_test_suite (Failed)
2024-11-07T16:20:07.7190561Z 	 52 - governance_test (Failed)
2024-11-07T16:20:07.7190874Z 	 55 - e2e_logging_cft (Failed)
2024-11-07T16:20:07.7191191Z 	 59 - e2e_logging_http2 (Failed)
2024-11-07T16:20:07.7191519Z 	 61 - membership_api_1 (Failed)
2024-11-07T16:20:07.7191848Z 	 66 - lts_compatibility (Failed)
2024-11-07T16:20:07.7192171Z 	 70 - acme_endorsement_test (Failed)

acme_endorsement_test is unrelated, pebble isn't installed.

Worryingly we may be missing some TSAN information from the unit tests - they're either muzzled by the test wrapper, or non-fatal warnings:

$ TSAN_OPTIONS=second_deadlock_stack=1  ./snapshot_test 
[doctest] doctest version is "2.4.11"
[doctest] run with "--help" for options
==================
WARNING: ThreadSanitizer: lock-order-inversion (potential deadlock) (pid=360032)
  Cycle in lock order graph: M0 (0x7b4400000be8) => M1 (0x7fff03361dc8) => M0

  Mutex M1 acquired here while holding mutex M0 in main thread:
    #0 pthread_mutex_lock <null> (snapshot_test+0x83a0a) (BuildId: 6ef6b264fe1f8e764247d52773f4c39cfad93b37)
    #1 std::__1::mutex::lock() <null> (libc++.so.1+0x4af15) (BuildId: e3dee72a81fed73680e4d05b6858c5327d95f499)
    #2 ccf::kv::Store::get_map(unsigned long, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) /data/src/2.CCF/build.san/../src/kv/store.h:238:40 (snapshot_test+0x1c04cc) (BuildId: 6ef6b264fe1f8e764247d52773f4c39cfad93b37)
...

SUMMARY: ThreadSanitizer: lock-order-inversion (potential deadlock) (/data/src/2.CCF/build.san/snapshot_test+0x83a0a) (BuildId: 6ef6b264fe1f8e764247d52773f4c39cfad93b37) in pthread_mutex_lock
==================
===============================================================================
[doctest] test cases:  1 |  1 passed | 0 failed | 0 skipped
[doctest] assertions: 13 | 13 passed | 0 failed |
[doctest] Status: SUCCESS!
ThreadSanitizer: reported 1 warnings

@eddyashton
Copy link
Member Author

Closing this PR, superceded by @maxtropets' work in #7201/#7232.

@eddyashton eddyashton closed this Sep 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-long-test Run Long Test job
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants