Fix race condition in FontFile data access during threaded imports#107224
Fix race condition in FontFile data access during threaded imports#107224d6e wants to merge 3 commits intogodotengine:masterfrom
Conversation
Use call_deferred() instead of direct call() in ResourceLoader::resource_changed_emit() to ensure resource change callbacks execute on the main thread. This prevents segfaults when font imports trigger node operations from background threads.
047e113 to
50211c8
Compare
|
Had to force push an update because I forgot to run the formatter |
6bb7cc3 to
50211c8
Compare
|
Added bounds checking but then decided against it because if the array sizes mismatch that would indicate a deeper issue |
|
Do you happen to have an MRP so that PR reviewers can easily test your code and confirm that it works? |
|
Also should we close #107218 now since this PR fully supersedes it? |
I don't have one, it's an issue that occurs in our very large project and only happens 1 out of approximately 20 times. Do I need one to continue with this? |
Not necessarily, it just significantly increases the speed that this will be merged. Without an MRP it might be challenging for a reviewer to verify that the code is correct and actually works. That being said, I am unfamiliar with this code, another reviewer may be able to read the code and determine that it is obviously correct |
|
Okay, I think this MRP should do it. I can confirm it crashes without the fix and does not crash with the fix |
scene/resources/font.cpp
Outdated
| // Force a unique copy to prevent COW issues | ||
| data.write[0] = data[0]; |
There was a problem hiding this comment.
This is exactly opposite of what is wanted in the absolute majority of cases, the whole point of having PackedByteArray is ensuring the same font data is never copied.
What probably should be done instead is getting red of raw pointer at all and only keep PackedByteArray in both Font and TextServer (this will require adding a way to make PackedByteArray from static memory buffer without make a copy).
There was a problem hiding this comment.
diff --git a/scene/resources/font.cpp b/scene/resources/font.cpp
index 357d22c3e4..63ecd42418 100644
--- a/scene/resources/font.cpp
+++ b/scene/resources/font.cpp
@@ -592,6 +592,8 @@ _FORCE_INLINE_ void FontFile::_ensure_rid(int p_cache_index, int p_make_linked_f
if (p_make_linked_from >= 0 && p_make_linked_from != p_cache_index && p_make_linked_from < cache.size()) {
cache.write[p_cache_index] = TS->create_font_linked_variation(cache[p_make_linked_from]);
} else {
+ MutexLock lock(data_mutex);
+
cache.write[p_cache_index] = TS->create_font();
TS->font_set_data_ptr(cache[p_cache_index], data_ptr, data_size);
TS->font_set_antialiasing(cache[p_cache_index], antialiasing);
@@ -1410,9 +1412,10 @@ void FontFile::_get_property_list(List<PropertyInfo> *p_list) const {
void FontFile::reset_state() {
_clear_cache();
- data.clear();
+ data = PackedByteArray();
data_ptr = nullptr;
data_size = 0;
+ data_external = false;
cache.clear();
antialiasing = TextServer::FONT_ANTIALIASING_GRAY;
@@ -2075,9 +2078,10 @@ Error FontFile::load_dynamic_font(const String &p_path) {
void FontFile::set_data_ptr(const uint8_t *p_data, size_t p_size) {
MutexLock lock(data_mutex);
- data.clear();
+ data = PackedByteArray();
data_ptr = p_data;
data_size = p_size;
+ data_external = true;
for (int i = 0; i < cache.size(); i++) {
if (cache[i].is_valid()) {
@@ -2091,16 +2095,9 @@ void FontFile::set_data(const PackedByteArray &p_data) {
// Make a copy to ensure data stability before getting the pointer
data = p_data;
- // Ensure data is not empty before getting pointer
- if (data.size() > 0) {
- // Force a unique copy to prevent COW issues
- data.write[0] = data[0];
- data_ptr = data.ptr();
- data_size = data.size();
- } else {
- data_ptr = nullptr;
- data_size = 0;
- }
+ data_ptr = data.ptr();
+ data_size = data.size();
+ data_external = false;
for (int i = 0; i < cache.size(); i++) {
if (cache[i].is_valid()) {
@@ -2112,11 +2109,9 @@ void FontFile::set_data(const PackedByteArray &p_data) {
PackedByteArray FontFile::get_data() const {
MutexLock lock(data_mutex);
- if (unlikely((size_t)data.size() != data_size)) {
+ if (unlikely(data_external && data.is_empty() && data_ptr && data_size > 0)) {
data.resize(data_size);
- if (data_ptr && data_size > 0) {
- memcpy(data.ptrw(), data_ptr, data_size);
- }
+ memcpy(data.ptrw(), data_ptr, data_size);
}
return data;
}
diff --git a/scene/resources/font.h b/scene/resources/font.h
index 3dd6b77aa4..95ecf2b78d 100644
--- a/scene/resources/font.h
+++ b/scene/resources/font.h
@@ -189,6 +189,7 @@ class FontFile : public Font {
mutable Mutex data_mutex;
const uint8_t *data_ptr = nullptr;
size_t data_size = 0;
+ bool data_external = false;
mutable PackedByteArray data;
TextServer::FontAntialiasing antialiasing = TextServer::FONT_ANTIALIASING_GRAY;This seems to work as well, and should avoid unneeded COW copies. Changing data in get_data should never happen for loaded fonts, it's only there to support get_data on built-in fonts.
It's also reasonable to use the same mutex when TS font objects are created in _ensure_rid, since the same pointers are accessed.
There was a problem hiding this comment.
Tested locally on Windows 11 24H2 with the MRP linked above, it works as expected.
Note that I had to increase the memory/limits/message_queue/max_size_mb project setting in the MRP to get it to run correctly (I set it to 2048 just to be sure).
=== FontFile Race Condition Reproduction ===
This MRP uses real threading to reproduce the FontFile race condition
that causes crashes during concurrent font data access.
Created FontFile with test data, size: 2048
Starting threaded race condition test...
(This should crash in versions without the FontFile fix)
Stopping threads...
=== Results ===
UNEXPECTED: No crashes detected after 2275500 iterations
This suggests the FontFile race condition fix is already present,
or the race condition wasn't triggered this time.
Expected behavior:
- WITHOUT fix: Segfaults/crashes during execution
- WITH fix: Completes without crashes
However, in master, running the MRP didn't crash either, both with unoptimized and optimized editor builds compiled using MSVC 2022. I'm on a 9950X3D, so maybe I would need to adjust OS.delay_msec() calls to get the race condition to occur (assuming it was tested on a slower CPU).
|
@d6e Which OS and compiler are you using to reproduce the issue (and test the fix)? |
I can't get it working with or without fix, it's just getting stuck and endlessly consuming memory (killed it after going over 40 GB), it's not related to the Tested on macOS 26, for the reference. |
|
Thanks for the suggested implementation! I've applied your data_external approach which avoids the unnecessary COW copies. I also added mutex protection to I don't really have the time to get into the MRP right now, do you want to go with your suggested implementation? |
Linux and gcc 13.3.0 |
Add mutex protection to FontFile data access methods to prevent race conditions when multiple import threads access the same cached FontFile simultaneously. Use data_external flag to track whether data came from set_data_ptr() vs set_data() to avoid unnecessary PackedByteArray COW copies.
b351451 to
adba698
Compare
Co-authored-by: A Thousand Ships <96648715+AThousandShips@users.noreply.github.com>
Problems
FontFile Data Race
FontFilefromResourceCache.set_data(), the other callsget_data()at the same time.Resource Change Callbacks on Background Threads
Import threads were firing resource-change callbacks directly.
That caused node operations off the main thread, triggering errors like:
Fixes
1. FontFile Thread Safety
Files updated:
scene/resources/font.handscene/resources/font.cppWhat changed:
mutable Mutex data_mutexinsideFontFile.set_data_ptr(),set_data(), andget_data()indata_mutexlocks.set_data(), forced a unique COW copy (data.write()[0] = data[0]) so no other thread is still holding an old pointer.get_data()so it won’t try to read invalid data.2. Resource Callback Thread Safety
File updated:
core/io/resource_loader.cppWhat changed:
rcc.callable.call()forrcc.callable.call_deferred()inresource_changed_emit().