Skip to content

Conversation

Turbo87
Copy link
Member

@Turbo87 Turbo87 commented Jun 27, 2025

This PR introduces basic source code analysis for newly published versions. A new crates_io_linecount workspace crate uses the tokei crate to analyze source files during the publish process. The system collects language breakdowns and line count statistics, storing them as JSON in a new linecounts column on the versions table.

The analysis runs during tarball processing and excludes test directories and non-programming files. All existing functionality remains unchanged, with the new column being optional for backward compatibility.

Update: The analysis has been moved to a dedicated background job that is triggered by the publish flow or manually via the crates-admin tool.

Note that this is only the first step in a series of pull requests. The follow-up PRs will:

  • implement a background job to backfill the existing versions
  • adjust the API responses to expose the data
  • show the total SLoC count in the crate sidebar of the website
  • adjust the OpenGraph images to show the SLoC count

@Turbo87 Turbo87 added C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works A-backend ⚙️ labels Jun 27, 2025
@Turbo87 Turbo87 moved this to For next meeting in crates.io team meetings Jun 27, 2025
Copy link
Contributor

@LawnGnome LawnGnome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No specific concerns on the implementation here.

I do wonder a little if we want to do this synchronously during publish. I don't expect line count calculation to be terribly expensive in the common case, but to defend against potential pathological cases, I might feel better about this if it was a background job. Not sure if anyone else on @rust-lang/crates-io has strong feelings here.

@Turbo87
Copy link
Member Author

Turbo87 commented Aug 6, 2025

I ran a quick benchmark of the process_tarball() fn using the mozjs_sys-0.67.1.crate file:

Before

tarball_processing/process_mozjs_tarball
                        time:   [365.17 ms 368.25 ms 373.35 ms]

After

tarball_processing/process_mozjs_tarball
                        time:   [990.60 ms 1.0022 s 1.0289 s]
                        change: [+173.34% +177.54% +181.29%] (p = 0.00 < 0.05)
                        Performance has regressed.

Admittedly, that is a more severe performance impact than I had anticipated. It might still be small compared to the time it takes to upload the crate file to our server, but still quite significant. I guess going with a background job makes more sense in this case, even though it requires us to download the crate file from the CDN again right after uploading it.

This introduces a new workspace crate that provides line counting functionality using `tokei`. The crate includes `LinecountStats` and `LanguageStats` data structures for storing results, along with core analysis functions for processing file contents.

The implementation includes language filtering to exclude non-programming files and path filtering to skip test and example directories. Comprehensive test coverage is provided with `insta` snapshots to ensure reliable functionality.

This crate provides the foundation for adding SLOC metrics to crates.io by offering a clean, testable interface for analyzing source code statistics.
This commit adds a new `JSONB` column called `linecounts` to the versions table to store Source Lines of Code statistics for each crate version. The column stores language breakdown and totals as structured `JSON` data, enabling flexible schema evolution without requiring additional migrations.

The database schema and test snapshots are updated accordingly to reflect this new column structure.
This adds the `linecounts` field to both the `Version` struct and `NewVersion` builder. The field stores linecount data as JSON, following the established pattern for flexible schema evolution without requiring additional migrations.

The `linecounts` field is `Optional` to handle existing versions that don't have this data, and will be populated for new versions during the publish process. This design ensures backward compatibility while enabling rich source code metrics for future crate versions.
@Turbo87
Copy link
Member Author

Turbo87 commented Sep 15, 2025

@LawnGnome thanks for nudging this in the right direction. I've changed the implementation to run in a background job instead which should resolve the performance concerns.

@rust-lang/crates-io unless there are any objections I plan on merging this towards the end of the week.

@Turbo87 Turbo87 requested review from LawnGnome and a team September 15, 2025 15:23
@Turbo87 Turbo87 merged commit 8720d77 into rust-lang:main Sep 18, 2025
10 checks passed
@Turbo87 Turbo87 deleted the sloc branch September 18, 2025 08:07
@epage
Copy link

epage commented Sep 18, 2025

I'm curious, in what way are you looking at using this or do places that inspried this use this? I'm not seeing any linked issue or comment to provide the bigger picture.

@Turbo87
Copy link
Member Author

Turbo87 commented Sep 18, 2025

the main usages are for spam detection and nice-to-have metadata display

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-backend ⚙️ C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants