-
Notifications
You must be signed in to change notification settings - Fork 397
Description
Search before asking
- I searched in the issues and found nothing similar.
Fluss version
0.7.0 (latest release)
Please describe the bug 🐞
In the original code, the download lock (prefetchSemaphore) is released only in two cases:
- After a RemoteLogSegment has been successfully read (drained), the lock is released via recycleRemoteLog.

- When the download of a file fails, the lock is released.

Let us simplify the model: suppose a bucket contains three segment files — A, B, and C — and client.scanner.remote-log.prefetch-num = 1.
- File A fails to download, so the lock is released. File A is then added back to the end of the queue.
- File B downloads successfully, but the lock is not immediately released because it hasn't been drained.
Since file A has an earlier offset, it remains at the front of the queue and must be processed first. However, file B holds the prefetch lock, and file A cannot be reattempted until the lock is acquired again. But because B will never be drained (as A blocks its processing), the lock is never released — resulting in a deadlock. - file C will not be downloaded, and file A will never be retried. The entire job becomes stuck.
Solution
When RemoteLogDownloader failed to download a file, no longer release the semaphore but retied to download for several times. If still failed to download, but thrown the exception out of client, let the flink job fails and restarts.
Are you willing to submit a PR?
- I'm willing to submit a PR!
Metadata
Metadata
Assignees
Labels
No labels