Skip to content

Conversation

kaikaila
Copy link
Contributor

@kaikaila kaikaila commented Jun 25, 2025

Summary

This PR migrates Kubeflow Pipelines backend from GORM v1 (github.com/jinzhu/gorm) to GORM v2 (gorm.io/gorm). It covers all ORM models and the execution cache subsystem.

Breaking Changes

  • Stricter Field Length Constraints

    • Certain string fields now have tighter length limits to ensure indexability across MySQL (e.g., utf8mb4 key length). See the complete list of constrained fields in backend/src/apiserver/validation/length.go.
    • API layer enforces these limits: overlong inputs are rejected with HTTP 400 and clear messages.
  • Upgrade Guardrails for Legacy Data

    • During upgrade, a preflight scan aborts migration if existing rows violate the new limits. Users must shorten those values before retrying.
  • (Name, Namespace) Unique Index Deduplication (pipelines)

    • Historically there were two equivalent unique indexes on (Name, Namespace): namespace_name (from tag) and name_namespace_index (manual).
    • We now keep namespace_name only. If both exist, the legacy one is removed (or renamed to namespace_name when safe).

Migration/Upgrade Behavior

  • Legacy schema (pre-2.15) detected → run legacy upgrade flow:
1. Run preflight check to ensure if existing rows comply with the new length limits
   2. Drop foreign key constraints only (minimal DDL needed to shrink indexed columns).
3. Targeted legacy index cleanup (MySQL only):
- Drop single-column indexes on experiments and pipelines (legacy residues). Reference [here](https://github.com/kubeflow/pipelines/blob/cdc85ce90db7b821ad25cfee925b21dc2b22bbbe/backend/src/apiserver/client_manager/client_manager.go#L420C1-L428C3).
- Drop composite unique index idx_pipeline_version_uuid_name on pipeline_versions (historical). Reference [here](https://github.com/kubeflow/pipelines/blob/cdc85ce90db7b821ad25cfee925b21dc2b22bbbe/backend/src/apiserver/client_manager/client_manager.go#L518).
- Normalize (Name, Namespace) unique index on pipelines to namespace_name (keep/rename/drop as needed). 
4. Shrink columns per new limits; then AutoMigrate re-applies constraints and indexes from tags.
5. Backfill DisplayName for pipelines / pipeline_versions where needed, and ExperimentUUID in run_details.
  • Non-legacy schema (KFP >=2.15): run autoMigrate for both first-time installs and upgrades between >=2.15 versions.

Internal Refactors

  • Migrated to GORM v2 Migrator:
  • Replaced v1 APIs (AddIndex, RemoveIndex, AddForeignKey, AddUniqueIndex, ModifyColumn) with v2 equivalents and struct tags.
  • Index/constraint creation now lives in tags; legacy hand-crafted index DDL in InitDBClient is removed or minimized.
  • InitDBClient flow split: Clear separation of legacy upgrade paths and non-legacy schema for readability and safety.
  • Unified validation source of truth: Centralized length specs validation/length.go drive both API guards and DDL shrink to prevent drift.
  • Abstract dialect related syntax from InitDBClient to dialect.go

Unit Tests

  • API-level length validation (pass/fail).
  • Preflight length scan (blocks on violations).
  • Idempotent legacy index cleanup (no-op when already normalized).

Copy link

Hi @kaikaila. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

🚫 This command cannot be processed. Only organization members or owners can use the commands.

@kaikaila
Copy link
Contributor Author

Hi @HumairAK
I wanna bring this to your attention. In backend/src/apiserver/model/task.go line 26, the field name RunId ≠ column name RunUUID. This may cause confusion when using foreignKey: in GORM v2. Happy to refactor if we want to align them.

Copy link

@kaikaila: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

🚫 This command cannot be processed. Only organization members or owners can use the commands.

@kaikaila kaikaila force-pushed the chore/gorm-v2-migration branch 2 times, most recently from 16600ee to 4287d53 Compare June 25, 2025 23:29
@kaikaila kaikaila force-pushed the chore/gorm-v2-migration branch 3 times, most recently from fc90ec9 to fe390b9 Compare July 2, 2025 08:50
@google-oss-prow google-oss-prow bot added size/XXL and removed size/XL labels Jul 5, 2025
@kaikaila kaikaila force-pushed the chore/gorm-v2-migration branch from 676f98f to 58897c4 Compare July 5, 2025 00:04
@google-oss-prow google-oss-prow bot added size/XL and removed size/XXL labels Jul 5, 2025
UUID string `gorm:"column:UUID; not null; primaryKey;"`
CreatedAtInSec int64 `gorm:"column:CreatedAtInSec; not null;"`
Name string `gorm:"column:Name; not null; unique_index:namespace_name;"` // Index improves performance of the List ang Get queries
Name string `gorm:"column:Name; not null; uniqueIndex:namespace_name; type:varchar(191);"` // Index improves performance of the List ang Get queries
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks unaddressed

@kaikaila kaikaila force-pushed the chore/gorm-v2-migration branch 4 times, most recently from a5f7ae7 to e7e7f7a Compare July 10, 2025 01:30
@HumairAK HumairAK changed the title [wip]chore(backend): migrate GORM v1 to v2 chore(backend): migrate GORM v1 to v2 Aug 7, 2025
@HumairAK
Copy link
Collaborator

HumairAK commented Aug 7, 2025

@kaikaila can you also rebase your pr such that I'm not a co-oauthor, this is all your work so you should be the only author listed!

@kaikaila kaikaila force-pushed the chore/gorm-v2-migration branch from cac2afd to 148090c Compare August 7, 2025 23:01
@kaikaila
Copy link
Contributor Author

kaikaila commented Aug 7, 2025

Hi @HumairAK
Thanks for your comments. I’ve addressed all of them and reviewed the schema diff to confirm it matches the expected changes. For now, I’ve roughly split the changes into 3 commits for my own convenience in case we need to make further updates. If Matt still needs to review, I’ll keep them as 3 commits; if not, I'm happy to squash them into a single commit.

scope.Raw(
"ALTER TABLE " + quotedTableName + " ADD COLUMN DisplayName VARCHAR(255) NULL;",
).Exec()
scope.Raw("UPDATE " + quotedTableName + " SET DisplayName = Name").Exec()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need this update code so that users upgrading from say 2.4 to this version will have the DisplayName column filled in since it's a required field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mprahl, thanks for pointing that out. I didn't realize the function also backfills the DisplayName column. I’ve restored the addDisplayName function.

Q1: Is it worth adding a unit test for it using sqlmock?

Q2: In GORM v1, addDisplayName first checked whether the user had already added a DisplayName column.
If the user had already customized this column, we allowed it to be nullable — which differs from the GORM tag that requires DisplayName to be NOT NULL.

Could you confirm whether we should enforce NOT NULL in this case?
If the user already has a DisplayName column, should we fill any null values with the Name value and then set the column to NOT NULL?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the context is that only Name existed. DisplayName got added as a required column but to do so, you first need to create the DisplayName column as nullable, copy the values from Name to existing rows, and then make DisplayName not nullable.

So we just need to keep that flow. No need to add additional test coverage unless it's easy to do.

Copy link
Contributor Author

@kaikaila kaikaila Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I’ve restored the flow to: ADD as NULL → UPDATE from Name → enforce NOT NULL.

One follow-up on edge cases the v1 logic didn’t cover:
today we only run addDisplayName if the column is missing. If an installation already has a user-created DisplayName column (possibly nullable, with some NULLs), the legacy code does nothing — no backfill and no NOT NULL enforcement.

Question: what’s our policy for that case?
• Do we want to enforce consistency with the current model (i.e., still run UPDATE … WHERE DisplayName IS NULL and then set the column to NOT NULL, even if the column already exists)?
• Or do we leave user-created columns as-is and only enforce non-null on new writes at the API layer?

Right now I can implement the first option safely (backfill NULLs then set NOT NULL). Please advise which direction we prefer.

Separately, note that DropAllConstraintsAndIndexes drops all FKs/UNIQUE/non-primary indexes. That will also remove user-created indexes and AutoMigrate will only recreate the GORM-tagged ones. Is that acceptable, or should we limit drops to KFP-managed objects only?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaikaila we can assume the user didn't manually add a column and if they did, the migration should fail. 😄

return fmt.Errorf("failed to backfill experiment UUID in run_details table: %s", err)
}

if err := db.Migrator().AlterColumn(&model.Pipeline{}, "Description"); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment explaining why this is not handled in AutoMigrate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re right, it is unnecessary. I removed the AlterColumn.

}

// Step 3: drop all indexes and constraints except primary key which blocks shrinking columns
if err := DropAllConstraintsAndIndexes(db, dialect.Name); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you considered dropping only the constraints required for us to migrate without error instead of dropping all of them?

If this is feasible, that would be ideal - the concern is that if a user has a large number of runs it may take a while to rebuild the indexes

Copy link
Contributor Author

@kaikaila kaikaila Aug 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion — I implemented dropLegacyIndexes function which only drops the specific indexes that block the migration.

Also, I have a follow-up question: since KFP has never officially supported pgx before, is it reasonable to handle pgx only in the fresh install path, and not cover the legacy upgrade path for it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that's fine, maybe we can just do the legacy check anyways but if the driverName is pgx we throw a meaningful error instead of proceeding with migration

@kaikaila kaikaila force-pushed the chore/gorm-v2-migration branch 3 times, most recently from e5418bb to 350781c Compare August 11, 2025 08:44
@kaikaila
Copy link
Contributor Author

kaikaila commented Aug 11, 2025

Here’s the SQL I used to simulate a very old schema for testing the legacy upgrade workflow (might be handy for your verification)

// switch to master branch (gorm v1)
// launch api server in master branch

USE mlpipeline;

// setup to test backfilling pipeline_version
DROP TABLE pipeline_versions;

// setup to test dropLegacyIndexes()
CREATE UNIQUE INDEX Name ON experiments (Name);
CREATE UNIQUE INDEX Name ON pipelines (Name);

// setup to test addDisplayNameColumn()
ALTER TABLE pipelines DROP COLUMN DisplayName; 

// insert dummy data to set up for test initPipelineVersionsFromPipelines()
INSERT INTO pipelines (UUID, CreatedAtInSec, Name, Description, Parameters, Status, DefaultVersionId, Namespace)
VALUES
('pipe-uuid-1', UNIX_TIMESTAMP(), 'pipeline1', 'Dummy pipeline 1', NULL, 'READY', NULL, 'default'),
('pipe-uuid-2', UNIX_TIMESTAMP(), 'pipeline2', 'Dummy pipeline 2', NULL, 'READY', NULL, 'default');

// switch to chore/gorm-v2-migration branch 
// launch api server again
// there should be 2 rows in pipeline_versions

@kaikaila
Copy link
Contributor Author

Since KFP hasn’t officially supported pgx before, would it be acceptable for InitDBClient to handle pgx only for fresh installs and skip it in the legacy upgrade path?

@kaikaila kaikaila force-pushed the chore/gorm-v2-migration branch from 350781c to 6d15489 Compare August 13, 2025 20:11
@kaikaila
Copy link
Contributor Author

/retest

@mprahl
Copy link
Collaborator

mprahl commented Aug 14, 2025

@kaikaila could you please squash your commits? Then I'll lgtm it!

Key changes:
- Enforce stricter string length limits (API + DB schema) to ensure MySQL indexability.
- Add preflight scan to block upgrade if legacy data violates limits.
- Cleanup/normalize legacy MySQL indexes, drop/rename duplicates.
- Split InitDBClient into legacy upgrade vs non-legacy autoMigrate paths.
- Centralize length specs for both API validation and DDL shrink.
- Replace GORM v1 APIs with v2 Migrator and struct tags.

Signed-off-by: kaikaila <[email protected]>
@kaikaila kaikaila force-pushed the chore/gorm-v2-migration branch from 6d15489 to 8e41c88 Compare August 14, 2025 19:13
@kaikaila
Copy link
Contributor Author

Squashed! 🎉 Thanks @mprahl — glad we’re almost there.

@kaikaila
Copy link
Contributor Author

/retest

@mprahl
Copy link
Collaborator

mprahl commented Aug 15, 2025

/lgtm great work!

@HumairAK
Copy link
Collaborator

Tested and verified.

/lgtm
/approve

Amazing work @kaikaila 🥳 🥇 🎉 !!!

@google-oss-prow google-oss-prow bot added the lgtm label Aug 15, 2025
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: HumairAK

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 2af42c3 into kubeflow:master Aug 15, 2025
82 of 86 checks passed
@kaikaila kaikaila deleted the chore/gorm-v2-migration branch August 15, 2025 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants