Skip to content

docs: clarify Parameters for the add_files API #2249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Xiezhibin
Copy link

Summary

Related Issue: #2132

  1. This PR enhances the documentation for the add_files API by:
  2. Adding a parameter table to clarify the required and optional inputs and outputs.
  3. Providing a complete example that includes all parameters, such as snapshot_properties and check_duplicate_files.
  4. Strengthening the warning regarding the default setting of check_duplicate_files=True and the associated risks of disabling it.

<!-- prettier-ignore-start -->

!!! note "Name Mapping"
Because `add_files` uses existing files without writing new parquet files that are aware of the Iceberg's schema, it requires the Iceberg's table to have a [Name Mapping](https://iceberg.apache.org/spec/?h=name+mapping#name-mapping-serialization) (The Name mapping maps the field names within the parquet files to the Iceberg field IDs). Hence, `add_files` requires that there are no field IDs in the parquet file's metadata, and creates a new Name Mapping based on the table's current schema if the table doesn't already have one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the paragraphs with !!! are missing the 4 spaces, like how they are written initially

the spaces are there to render these boxes correctly
Screenshot 2025-07-28 at 10 01 46 PM

could you add them back?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your reply. I will submit this!

Because `add_files` commits the existing parquet files to the Iceberg Table as any other data file, destructive maintenance operations like expiring snapshots will remove them.

!!! warning "Check Duplicate Files"
The `check_duplicate_files` parameter is `True` by default and will check the new files against the existing Iceberg table data files to prevent duplicates. This check can be expensive for large tables with many files. It is recommended to use the default configuration. The check can be turned off by setting `check_duplicate_files=False`, but this may result in duplicate files being added to the table, which can lead to data consistency issues and potential table corruption if the same data file is added multiple times.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `check_duplicate_files` parameter is `True` by default and will check the new files against the existing Iceberg table data files to prevent duplicates. This check can be expensive for large tables with many files. It is recommended to use the default configuration. The check can be turned off by setting `check_duplicate_files=False`, but this may result in duplicate files being added to the table, which can lead to data consistency issues and potential table corruption if the same data file is added multiple times.
The `check_duplicate_files` parameter controls whether the method checks if any of the provided `file_paths` are already present in the Iceberg table. By default, it is set to `True`, which performs a validation against the table’s current data files to prevent accidental duplication.
This check helps maintain data consistency by ensuring that the same data file is not added multiple times. However, for tables with a large number of files, this validation can be expensive in terms of performance.
To skip the duplicate check, set `check_duplicate_files=False`. This can improve performance but increases the risk of introducing duplicate files, which may lead to data inconsistency or table corruption if the same file is added more than once.

I used LLM to generate this based on the function definition. WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your guidance; it makes a lot of sense. I appreciate your insights, and I believe it would be beneficial to offer users a clear recommendation. for example, it would be wise to suggest keeping the check_duplicate_files parameter set to True by default to help ensure data consistency and avoid accidental duplication. If users do experience performance bottlenecks related to the validation process, they might then consider setting it to False.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants