-
Notifications
You must be signed in to change notification settings - Fork 668
Add SLoC (Source Lines of Code) metric to versions #11453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No specific concerns on the implementation here.
I do wonder a little if we want to do this synchronously during publish. I don't expect line count calculation to be terribly expensive in the common case, but to defend against potential pathological cases, I might feel better about this if it was a background job. Not sure if anyone else on @rust-lang/crates-io has strong feelings here.
I ran a quick benchmark of the Before
After
Admittedly, that is a more severe performance impact than I had anticipated. It might still be small compared to the time it takes to upload the crate file to our server, but still quite significant. I guess going with a background job makes more sense in this case, even though it requires us to download the crate file from the CDN again right after uploading it. |
This introduces a new workspace crate that provides line counting functionality using `tokei`. The crate includes `LinecountStats` and `LanguageStats` data structures for storing results, along with core analysis functions for processing file contents. The implementation includes language filtering to exclude non-programming files and path filtering to skip test and example directories. Comprehensive test coverage is provided with `insta` snapshots to ensure reliable functionality. This crate provides the foundation for adding SLOC metrics to crates.io by offering a clean, testable interface for analyzing source code statistics.
This commit adds a new `JSONB` column called `linecounts` to the versions table to store Source Lines of Code statistics for each crate version. The column stores language breakdown and totals as structured `JSON` data, enabling flexible schema evolution without requiring additional migrations. The database schema and test snapshots are updated accordingly to reflect this new column structure.
This adds the `linecounts` field to both the `Version` struct and `NewVersion` builder. The field stores linecount data as JSON, following the established pattern for flexible schema evolution without requiring additional migrations. The `linecounts` field is `Optional` to handle existing versions that don't have this data, and will be populated for new versions during the publish process. This design ensures backward compatibility while enabling rich source code metrics for future crate versions.
@LawnGnome thanks for nudging this in the right direction. I've changed the implementation to run in a background job instead which should resolve the performance concerns. @rust-lang/crates-io unless there are any objections I plan on merging this towards the end of the week. |
I'm curious, in what way are you looking at using this or do places that inspried this use this? I'm not seeing any linked issue or comment to provide the bigger picture. |
the main usages are for spam detection and nice-to-have metadata display |
This PR introduces basic source code analysis for newly published versions. A new
crates_io_linecount
workspace crate uses thetokei
crate to analyze source files during the publish process. The system collects language breakdowns and line count statistics, storing them as JSON in a newlinecounts
column on theversions
table.The analysis runs during tarball processing and excludes test directories and non-programming files. All existing functionality remains unchanged, with the new column being optional for backward compatibility.Update: The analysis has been moved to a dedicated background job that is triggered by the publish flow or manually via the
crates-admin
tool.Note that this is only the first step in a series of pull requests. The follow-up PRs will:
implement a background job to backfill the existing versions