Skip to content

fix: Resolve JSONDecodeError in LLM fine-tuning tune() API#2610

Open
Priyanshu-u07 wants to merge 3 commits intokubeflow:masterfrom
Priyanshu-u07:fix-llm-training-parameters-json
Open

fix: Resolve JSONDecodeError in LLM fine-tuning tune() API#2610
Priyanshu-u07 wants to merge 3 commits intokubeflow:masterfrom
Priyanshu-u07:fix-llm-training-parameters-json

Conversation

@Priyanshu-u07
Copy link

Description

This PR fixes a bug in the Katib Python SDK where LLM training parameters were passed as an improperly quoted JSON string when using the tune() API. The extra shell quoting caused the LLM worker pod (PyTorch container) to fail with a JSONDecodeError, preventing fine-tuning experiments from running correctly.

The implementation ensures that:

  • training_parameters and lora_config are serialized as valid JSON without extra quotes before being passed to the container.
  • A validation check is added to ensure trainer_parameters.training_parameters is not empty, raising a helpful error if missing.
  • Unit tests are added to verify both the JSON serialization and the validation behavior.

This resolves the LLM fine-tuning errors in the worker pod, allowing experiments using the tune() API to initialize and run successfully.

Changes Included

sdk/python/v1beta1/kubeflow/katib/api/katib_client.py:

  • Fixed JSON serialization for training_parameters and lora_config container args by removing extra shell quoting.
  • Added type-safe json.dumps() conversion to ensure Kubernetes args are always strings.
  • Added early validation for missing training_parameters with a user-friendly error message linking to documentation.

sdk/python/v1beta1/kubeflow/katib/api/katib_client_test.py

  • Added a test case for missing training_parameters.
  • Added a test case to verify correct JSON serialization format in container arguments.

Testing:

  • All 25 existing tests pass.
  • New unit tests verify that the bug is fixed and the container receives correctly formatted JSON.

Fixes #2587

Checklist:

  • Docs included if any changes are user facing

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link

🎉 Welcome to the Kubeflow Katib repo! 🎉

Thanks for opening your first PR! We're excited to have you onboard 🚀

Next steps:

Feel free to ask questions in the comments. Thanks again for contributing! 🙏

Signed-off-by: Priyanshu-u07 <connect.priyanshu8271@gmail.com>
Signed-off-by: Priyanshu-u07 <connect.priyanshu8271@gmail.com>
@Priyanshu-u07 Priyanshu-u07 force-pushed the fix-llm-training-parameters-json branch from cefc33f to 39cdc41 Compare January 29, 2026 21:29
@google-oss-prow google-oss-prow bot added size/M and removed size/L labels Jan 29, 2026
Signed-off-by: Priyanshu-u07 <connect.priyanshu8271@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LLMs Fine-Tuning Errors in llm worker pod with pytorch conatiner

1 participant