Skip to content

Conversation

@khandelwal-prateek
Copy link
Contributor

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):

  • Gobblin's current IcebergDatasetFinder is tightly coupled to Iceberg→Iceberg copying and cannot output raw files or copy to arbitrary sinks like Azure.

  • This change creates a decoupled IcebergSource to enable copying Iceberg files to any destination.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@khandelwal-prateek khandelwal-prateek marked this pull request as draft October 16, 2025 10:12
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add below improvements in this as similar to CopySource

  1. Add support for simulate mode.
  2. Set ServiceConfigKeys.WORK_UNIT_SIZE detail in Workunit to better use of dynamic scaling based on size than based on file count (i.e. workunit count).
  3. Introduce binPacking for workunits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants