Refactor FilesExt: extract presigned URL coordination into builder#1295
Refactor FilesExt: extract presigned URL coordination into builder#1295parthban-db merged 4 commits intomainfrom
Conversation
…anagement (#1294) ## 🥞 Stacked PR Use this [link](https://github.com/databricks/databricks-sdk-py/pull/1294/files) to review incremental changes. - [**stack/refactor-files-client-2**](#1294) [[Files changed](https://github.com/databricks/databricks-sdk-py/pull/1294/files)] - [stack/refactor-files-client-3](#1295) [[Files changed](https://github.com/databricks/databricks-sdk-py/pull/1295/files/a2305e5743d3d08230f26e08932969f81dff4185..f9e1b4f9bb0829e87e450753544d236e928c4a05)] --------- <!-- This template provides a recommended structure for PR descriptions. Adapt it freely — the goal is clarity, not rigid compliance. The three-section format (Summary / Why / What Changed) helps reviewers understand the change quickly and makes the PR easier to revisit later. --> ## Why Today the FilesExt class builds API URLs using hardcoded relative paths (e.g. "/api/2.0/fs/create-upload-part-urls") and creates a new unauthenticated requests.Session on every upload/download operation. This works, but has two limitations: - No single point of control for request routing. When we add storage-proxy support, uploads need to be routed to a different hostname. With paths scattered across a dozen methods, everyone needs to change. A central _get_hostname() makes the future change a one-liner. - Unnecessary session churn. A cloud_provider_session was created by a high-level method and threaded through 5+ layers of calls as a parameter. Some methods received it only to pass it further down. This made signatures noisy, and the data flow hard to follow. ## What changed ### Interface changes None. All changes are to private methods. ### Behavioral changes <!-- User-visible behavior changes: different defaults, changed error messages, new side effects, performance characteristics. Write "None." if this is a pure refactor — this explicitly reassures reviewers. --> None. This is a pure refactor — the cloud provider session is now cached and reused instead of re-created per operation, but requests.Session is safe to share across calls. ### Internal changes <!-- Refactoring, file moves, implementation details, test infrastructure. Things that don't affect the public API or user-visible behavior. --> - _get_hostname() — new method returning the workspace hostname. All self._api.do() calls now use url=f"{hostname}/api/..." instead of relative paths. This is the future extension point for storage-proxy routing. - _cloud_provider_session() — renamed from _create_cloud_provider_session(). Caches the session on first call. Callers invoke self._cloud_provider_session() directly instead of storing a local variable. ## How is this tested? <!-- Describe any tests you have done, especially tests that are not part of the unit tests (e.g. local tests, integration tests, manual verification). ALWAYS ANSWER THIS QUESTION: answer with "N/A" if tests are not applicable to your PR (e.g. if the PR only modifies comments). Do not be afraid of answering "Not tested" if the PR has not been tested. Being clear about what has been done and not done provides important context to the reviewers. --> - All existing unit tests. - One new test case was added ("Create resumable URL: node exists but URL is empty") to cover the build_resumable_upload_url validation path, where the response node exists but contains no URL. NO_CHANGELOG=true
f9e1b4f to
245d575
Compare
245d575 to
9e504b1
Compare
Divyansh-db
left a comment
There was a problem hiding this comment.
LGTM, left some comments
|
|
||
| url = upload_part_url["url"] | ||
| required_headers = upload_part_url.get("headers", []) | ||
| assert current_part_number == upload_part_url["part_number"] |
There was a problem hiding this comment.
can we keep current_part_number as part of _PresignedUrl to validate this assertion?
databricks/sdk/mixins/files.py
Outdated
| "POST", url=f"{hostname}/api/2.0/fs/create-upload-part-urls", headers=headers, body=body | ||
| ) | ||
| presigned = builder.build_upload_part_urls( | ||
| ctx.target_path, session_token, part_index, 1, self._get_upload_url_expire_time() |
There was a problem hiding this comment.
| ctx.target_path, session_token, part_index, 1, self._get_upload_url_expire_time() | |
| ctx.target_path, session_token, part_index, 1, self._get_upload_url_expire_time(), |
nit, The parameters of a function should be linted in a uniform format (as https://github.com/databricks/databricks-sdk-py/pull/1295/changes#diff-469b96ba8e49ef7cbe8d3b4577d2a0ddf3a3552638b76d87bfccb936fda65d18R1849-R1854)
databricks/sdk/mixins/files.py
Outdated
| url = url_node.get("url") | ||
| if not url: | ||
| raise ValueError(f"Unexpected server response: {response}") | ||
| headers: dict[str, str] = {"Content-Type": "application/octet-stream"} |
There was a problem hiding this comment.
This was not set in the existing code
Range-diff: main (27164f7 -> 18f57c6)
Reproduce locally: |
|
If integration tests don't run automatically, an authorized user can run them manually by following the instructions below: Trigger: Inputs:
Checks will be approved automatically on success. |
🥞 Stacked PR
Use this link to review incremental changes.
Summary
Extracts the presigned URL coordination logic (
create-upload-part-urlsandcreate-resumable-upload-urlAPI calls and response parsing) into a dedicated_PresignedUrlRequestBuilderclass, replacing inline code in three upload methods.Why
The
create-upload-part-urlsAPI call, response validation, and header parsing was duplicated across_do_upload_one_part,_perform_multipart_upload, and_perform_resumable_upload(with a similar pattern forcreate-resumable-upload-url). Each copy built the request body, called_api.do(), validated the response structure, and converted the headers list to a dict — all inline. This made the upload methods longer than necessary and meant any change to the coordination logic required updating multiple places._PresignedUrlRequestBuilderconsolidates this into a single class. It also prepares for storage-proxy routing (#1278), where a different builder implementation can construct URLs directly instead of calling the presigned URL APIs.What changed
Interface changes
None. All changes are to private methods.
Behavioral changes
ValueError/KeyErrorfrom response parsing) now trigger fallback to single-shot upload on the first part, instead of propagating as hard errors. Previously the_api.do()call was wrapped intry/exceptbut the response parsing was outside it. Now both are encapsulated in the builder, so parsing errors are also caught. This is more resilient — a broken coordination response should not abort the upload when a simpler path exists.Internal changes
_PresignedUrl— new dataclass holding a resolved presigned URL and its associated headers._PresignedUrlRequestBuilder— new class withbuild_upload_part_urls()andbuild_resumable_upload_url(). Encapsulates the coordination API calls and response parsing._do_upload_one_part— replaced inlinecreate-upload-part-urlscall and response parsing withbuilder.build_upload_part_urls(..., count=1)[0]._perform_multipart_upload— replaced inlinecreate-upload-part-urlsbatch call and response parsing withbuilder.build_upload_part_urls(..., count=batch_size)._perform_resumable_upload— replaced inlinecreate-resumable-upload-urlcall and response parsing withbuilder.build_resumable_upload_url().How is this tested?
Unit tests.
NO_CHANGELOG=true