Skip to content

fix(error) commit table split into two(file and pull request) fixes issue#3682#3727

Closed
yashisthebatman wants to merge 1 commit intochaoss:mainfrom
yashisthebatman:main
Closed

fix(error) commit table split into two(file and pull request) fixes issue#3682#3727
yashisthebatman wants to merge 1 commit intochaoss:mainfrom
yashisthebatman:main

Conversation

@yashisthebatman
Copy link

Description

  • Refactored the commits table to properly separate commit-level metadata from file-level statistics. The existing commits table stored one row per file per commit, mixing commit data (author, committer, hash, timestamps) with file-level data (filename, lines added/removed/whitespace). This caused data redundancy and inflated counts in metric queries.

    Changes:

    • Refactored the Commit model in augur_data.py to contain only commit-level columns, with a UniqueConstraint on (repo_id, cmt_commit_hash).
    • Created a new CommitFile model (commit_files table) holding file-level columns (cmt_filename, cmt_added, cmt_removed, cmt_whitespace) with a foreign key to commits.cmt_id.
    • Rewrote facade_bulk_insert_commits() in lib.py to split incoming records into commit-level upserts and file-level inserts.
    • Updated all 8 cache-building SQL queries in rebuildcache.py to LEFT JOIN commit_files for file-level aggregation.
    • Added CommitFileType GraphQL type and files field on CommitType in server.py.
    • Created Alembic migration (revision 39) that handles table rename, backfill via SELECT DISTINCT ON, FK remapping for commit_parents and commit_comment_ref, and includes a full downgrade path.
    • Added 13 unit tests validating model structure, relationships, constraints, and helper functions.

This PR fixes #3682

Notes for Reviewers

  • The analyzecommit.py and facade_tasks.py files were intentionally left unchanged — facade_bulk_insert_commits() now handles the record splitting internally, keeping the data pipeline interface stable.
  • The committers metric in commit.py was not modified since it only queries commit-level columns, but it will now return more accurate results (one row per commit instead of one per file).
  • The Alembic migration (rev 39) needs to be tested against a staging PostgreSQL database before merging. The backfill uses DISTINCT ON (repo_id, cmt_commit_hash) to deduplicate.
  • The commented-out cache queries in rebuildcache.py were also updated for consistency in case they are re-enabled.

Signed commits

  • Yes, I signed my commits.

Copilot AI review requested due to automatic review settings February 19, 2026 18:07
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors the commits table to properly separate commit-level metadata from file-level statistics. Previously, the commits table stored one row per file per commit, mixing commit data with file-level data. This refactoring addresses Issue #3682 by splitting the table into two: a new commits table for commit-level data (one row per commit) and a commit_files table for file-level data (one row per file per commit).

Changes:

  • Separated commits table into commits (commit-level) and commit_files (file-level) tables with proper foreign key relationships
  • Updated database models with new CommitFile model and refactored Commit model
  • Modified facade_bulk_insert_commits() to split records and upsert appropriately
  • Updated 8 SQL cache-building queries in rebuildcache.py to LEFT JOIN commit_files
  • Added GraphQL CommitFileType and files relationship
  • Created Alembic migration (revision 39) with upgrade and downgrade paths
  • Added 13 unit tests for model validation and helper functions

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
augur/application/schema/alembic/versions/39_split_commits_into_commits_and_commit_files.py Alembic migration that renames commits to commit_files, creates new commits table, backfills data, and remaps foreign keys
augur/application/db/models/augur_data.py Refactored Commit model to remove file-level columns, added new CommitFile model with foreign key to commits
augur/application/db/models/init.py Exported CommitFile model for use throughout the application
augur/application/db/lib.py Rewrote facade_bulk_insert_commits() to split records into commit and file-level data with separate upserts, added helper functions
augur/tasks/git/util/facade_worker/facade_worker/rebuildcache.py Updated 8 cache-building SQL queries to LEFT JOIN commit_files for file-level aggregation
augur/api/server.py Added CommitFileType GraphQL type and files field on CommitType
tests/test_classes/test_commit_file_model.py Added comprehensive unit tests for model structure, relationships, constraints, and helper functions

@yashisthebatman yashisthebatman force-pushed the main branch 11 times, most recently from 2f29b5d to 376fc1c Compare February 19, 2026 22:19
Signed-off-by: yashisthebatman <yvchaudhary2005@gmail.com>
@yashisthebatman
Copy link
Author

@sgoggins Docker/Podman gives containers exactly 10 seconds to terminate gracefully on a docker-compose down. If they don't, it forcefully kills them with a SIGKILL, which registers as an error in Podman's CI tests.

I looked at augur/application/service_manager.py which controls shutting down the main container (augur-1). The shutdown signal handler was written serially:

It stopped Gunicorn and waited up to 5s.
Then it told each Celery worker to stop and waited 3s for each one.
Then it executed a slow new Python CLI subprocess for celery purge (taking ~3s).
Then it executed a curl shutdown command hardcoded to hit http://localhost:15672 (ignoring that rabbitmq runs on a separate container), causing it to hang until curl timed out.
This sequence of blocking operations meant a normal docker-compose down easily took 15-20 seconds to wrap up, causing the timeout zombie failure!
and so i have created a fix for that issue as well and pushed it in service_manager.py

@MoralCode
Copy link
Contributor

@yashisthebatman ideally finding a new problem like this would be reported as an issue so a solution can be discussed and planned and a PR can be linked to it. can you copy your comment there into a new issue?

Can you also update the title of this PR to better describe what change it makes?

@yashisthebatman
Copy link
Author

@sgoggins alright i'll just post the issue but also passing the podman test for this issue would require me to generate a solution for that .Docker is getting checked and also i have given the solution to this issue in the services_manager.py so could you suggest me what to do ?

@MoralCode
Copy link
Contributor

if the timeouts are too low, we can solve that in the other issue, merge the fix and then this PR can be rebased onto that new main branch that incorporates the fix

By the way, I am not Sean and you do not need to repeatedly ping maintainers to get your message seen, we already have our notifications on for this repo.

@yashisthebatman yashisthebatman changed the title Fixed Issue#3682 fix(error) commit table split into two(file and pull request) fixes issue#3682 Feb 19, 2026
@yashisthebatman
Copy link
Author

alright sorry for that. I have created the issue

@MoralCode
Copy link
Contributor

Speaking of this PR:

Splitting the commits table is a large, long term task that touches very core items in augur and needs to be done carefully and thoughtfully (with an eye towards how existing data in peoples existing databases will get migrated). Ideally by someone who is a long-time maintainer and has built up a lot of trust with the rest of the core team.

This isnt the kind of task that is a great fit for a good first issue

@MoralCode MoralCode closed this Feb 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

commits table is actually representing commit files

3 participants