Skip to content

RemoteLogDownloader deadlock when met exception. #1751

@loserwang1024

Description

@loserwang1024

Search before asking

  • I searched in the issues and found nothing similar.

Fluss version

0.7.0 (latest release)

Please describe the bug 🐞

In the original code, the download lock (prefetchSemaphore) is released only in two cases:

  1. After a RemoteLogSegment has been successfully read (drained), the lock is released via recycleRemoteLog.
Image
  1. When the download of a file fails, the lock is released.
Image

Let us simplify the model: suppose a bucket contains three segment files — A, B, and C — and client.scanner.remote-log.prefetch-num = 1.

  1. File A fails to download, so the lock is released. File A is then added back to the end of the queue.
  2. File B downloads successfully, but the lock is not immediately released because it hasn't been drained.
    Since file A has an earlier offset, it remains at the front of the queue and must be processed first. However, file B holds the prefetch lock, and file A cannot be reattempted until the lock is acquired again. But because B will never be drained (as A blocks its processing), the lock is never released — resulting in a deadlock.
  3. file C will not be downloaded, and file A will never be retried. The entire job becomes stuck.

Solution

When RemoteLogDownloader failed to download a file, no longer release the semaphore but retied to download for several times. If still failed to download, but thrown the exception out of client, let the flink job fails and restarts.

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions