Conversation
Updated download functions
minor fixed
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughConsolidates MS2 model download/load into MS2ModelManager (URL-based download with SSL and zip validation), adds psutil and an MS2PIP feature mapping, and introduces threading, logging, spectrum caching/streaming, memory-saving PSM handling, and Docker runtime ENV/WORKDIR adjustments to /app. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant CLI as Caller / CLI
participant MD as model_downloader.py
participant MM as MS2ModelManager
participant HTTP as Remote (model_url)
participant FS as Filesystem
rect rgb(230,245,255)
CLI->>MD: request download/load(model_dir)
MD->>MM: instantiate MS2ModelManager(model_dir)
end
rect rgb(245,255,230)
MM->>FS: check existing model files/zip
alt local models present
FS-->>MM: return installed models
else
MM->>HTTP: GET model_url (SSL via certifi)
HTTP-->>MM: stream zip bytes
MM->>FS: write zip to download path
MM->>MM: validate zip (is_model_zip)
end
end
rect rgb(255,250,230)
MM->>MM: load_installed_models(from zip or local)
MM-->>MD: models loaded / ready
MD-->>CLI: success
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Poem
Pre-merge checks❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Compliance Guide 🔍Below is a summary of compliance checks for this PR:
Compliance status legend🟢 - Fully Compliant🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label |
||||||||||||||||||||||||
PR Code Suggestions ✨Explore these optional code suggestions:
|
|||||||||||||||||||
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
quantmsrescore/transfer_learning.py (1)
206-209: Move import to module level.The
configure_loggingimport is placed inside the__init__method, but it should be at the top of the file alongside other imports (line 5 already imports from the same module).🔎 Proposed refactor
Move the import to the top of the file (around line 5):
from quantmsrescore.idxmlreader import IdXMLRescoringReader -from quantmsrescore.logging_config import get_logger +from quantmsrescore.logging_config import get_logger, configure_logging from quantmsrescore.openms import OpenMSHelperThen remove the import from the
__init__method:self._save_model_dir = save_model_dir - # Set up logging - from quantmsrescore.logging_config import configure_logging - configure_logging(log_level)quantmsrescore/ms2_model_manager.py (1)
284-291: Consider refactoring default argument computation.The default value for
charged_frag_typescallsget_charged_frag_types(frag_types, max_frag_charge)at function definition time, which means the computation happens once when the module is loaded, not per instance. While this may be intentional for performance, it can be confusing and is flagged by static analysis (Ruff B008).🔎 Proposed refactor
def __init__( self, - charged_frag_types=get_charged_frag_types(frag_types, max_frag_charge), + charged_frag_types=None, dropout=0.1, model_class: torch.nn.Module = ModelMS2Bert, device: str = "gpu", mask_modloss: Optional[bool] = None, override_from_weights: bool = False, **kwargs, # model params ): + if charged_frag_types is None: + charged_frag_types = get_charged_frag_types(frag_types, max_frag_charge) super().__init__(
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
Dockerfilequantmsrescore/constants.pyquantmsrescore/model_downloader.pyquantmsrescore/ms2_model_manager.pyquantmsrescore/transfer_learning.py
🧰 Additional context used
🧬 Code graph analysis (2)
quantmsrescore/transfer_learning.py (1)
quantmsrescore/logging_config.py (1)
configure_logging(54-165)
quantmsrescore/model_downloader.py (1)
quantmsrescore/ms2_model_manager.py (1)
MS2ModelManager(22-256)
🪛 Ruff (0.14.10)
quantmsrescore/ms2_model_manager.py
60-60: Avoid specifying long messages outside the exception class
(TRY003)
64-64: Avoid specifying long messages outside the exception class
(TRY003)
70-70: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
80-84: Avoid specifying long messages outside the exception class
(TRY003)
107-107: Unused method argument: model_type
(ARG002)
285-285: Do not perform function call get_charged_frag_types in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
303-303: No explicit stacklevel keyword argument found
Set stacklevel=2
(B028)
312-312: Unused method argument: kwargs
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Codacy Static Code Analysis
- GitHub Check: build (3.11)
- GitHub Check: build (3.11)
- GitHub Check: Build and Push QuantMS-Rescoring Docker Images
🔇 Additional comments (8)
Dockerfile (1)
24-24: LGTM! Build and runtime stages now aligned.The change aligns the build stage working directory with the runtime stage, improving consistency across the multi-stage build.
quantmsrescore/transfer_learning.py (3)
94-94: LGTM! Log level configuration added.The CLI option provides users with control over logging verbosity, with a sensible default.
110-112: LGTM! Parameter properly integrated.The log_level parameter is correctly added to the function signature with appropriate documentation.
Also applies to: 156-157
171-172: LGTM! Log level properly propagated.The log level is correctly normalized to uppercase and passed to the trainer initialization.
quantmsrescore/constants.py (1)
70-70: LGTM! Feature mapping extended.The new "MS2PIP:Cos" mapping follows the existing naming conventions and is properly positioned among related cosine features.
quantmsrescore/ms2_model_manager.py (3)
2-3: LGTM! Imports added for download functionality.The new imports support secure model downloading with SSL verification using certifi.
Also applies to: 17-19
26-26: Verify default model directory behavior.The constructor now uses
model_dir="."as the default, which resolves to the current working directory. Ensure this works correctly when the code is invoked from different directories or contexts (e.g., Docker containers, different execution environments).Consider whether a more stable default (e.g., a home directory path or config-based location) would be more robust across different execution contexts.
Also applies to: 39-48
58-60: LGTM! Proper security measures for downloads.The URL scheme validation and SSL context with certifi ensure secure model downloads. The nosec comment appropriately documents the acknowledged security consideration.
Also applies to: 69-70
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
quantmsrescore/ms2_model_manager.py (3)
62-66: Clarify or fix theoverwriteparameter semantics.The current logic is backwards from typical
overwritesemantics:
- When
overwrite=Trueand file exists: skips download (line 65 comment)- When
overwrite=Falseand file exists: raises errorTypically,
overwrite=Truemeans "replace the existing file," not "skip if it exists." This naming could confuse future maintainers.Consider either:
- Renaming the parameter to
allow_existingorskip_if_existsto match the actual behavior, or- Fixing the logic so
overwrite=Trueactually re-downloads and replaces the file🔎 Option 1: Rename parameter to match behavior
- def _download_models(self, model_zip_file_path: str, overwrite: bool = True) -> None: + def _download_models(self, model_zip_file_path: str, allow_existing: bool = True) -> None: """Download models if not done yet.""" url = self.model_url parsed = urllib.parse.urlparse(url) if parsed.scheme not in ("http", "https"): raise ValueError(f"Disallowed URL scheme: {parsed.scheme}") if os.path.exists(model_zip_file_path): - if not overwrite: + if not allow_existing: raise FileExistsError(f"Model file already exists: {model_zip_file_path}") - # File exists and overwrite is True, skip download + # File exists and allow_existing is True, skip download else:🔎 Option 2: Fix logic to match typical overwrite semantics
def _download_models(self, model_zip_file_path: str, overwrite: bool = True) -> None: """Download models if not done yet.""" url = self.model_url parsed = urllib.parse.urlparse(url) if parsed.scheme not in ("http", "https"): raise ValueError(f"Disallowed URL scheme: {parsed.scheme}") - if os.path.exists(model_zip_file_path): - if not overwrite: - raise FileExistsError(f"Model file already exists: {model_zip_file_path}") - # File exists and overwrite is True, skip download - else: + if os.path.exists(model_zip_file_path) and not overwrite: + raise FileExistsError(f"Model file already exists: {model_zip_file_path}") + + if not os.path.exists(model_zip_file_path) or overwrite: logging.info(f"Downloading pretrained models from {url} to {model_zip_file_path} ...")
282-282: Consider moving the function call out of the default argument.The default argument calls
get_charged_frag_types(frag_types, max_frag_charge)at function definition time. This can cause issues if:
- The function returns a mutable object shared across all instances
- The function depends on module-level state that changes
Move the call inside the function body instead.
🔎 Proposed fix
def __init__( self, - charged_frag_types=get_charged_frag_types(frag_types, max_frag_charge), + charged_frag_types=None, dropout=0.1, model_class: torch.nn.Module = ModelMS2Bert, device: str = "gpu", mask_modloss: Optional[bool] = None, override_from_weights: bool = False, **kwargs, # model params ): + if charged_frag_types is None: + charged_frag_types = get_charged_frag_types(frag_types, max_frag_charge) super().__init__( charged_frag_types=charged_frag_types,
299-303: Addstacklevel=2to the deprecation warning.The
warnings.warncall should includestacklevel=2so the warning points to the caller's code rather than this line. This helps users identify where they're using the deprecated parameter.🔎 Proposed fix
if mask_modloss is not None: warnings.warn( "mask_modloss is deprecated and will be removed in the future. To mask the modloss fragments, " "the charged_frag_types should not include the modloss fragments.", + stacklevel=2 )
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
quantmsrescore/ms2_model_manager.py
🧰 Additional context used
🪛 Ruff (0.14.10)
quantmsrescore/ms2_model_manager.py
60-60: Avoid specifying long messages outside the exception class
(TRY003)
64-64: Avoid specifying long messages outside the exception class
(TRY003)
71-71: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
81-85: Avoid specifying long messages outside the exception class
(TRY003)
282-282: Do not perform function call get_charged_frag_types in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
300-300: No explicit stacklevel keyword argument found
Set stacklevel=2
(B028)
309-309: Unused method argument: kwargs
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Codacy Static Code Analysis
- GitHub Check: Build and Push QuantMS-Rescoring Docker Images
- GitHub Check: build (3.11)
- GitHub Check: build (3.11)
🔇 Additional comments (3)
quantmsrescore/ms2_model_manager.py (3)
2-3: LGTM! Imports support custom download implementation.The new imports (
MODEL_DOWNLOAD_INSTRUCTIONS,is_model_zip,urllib,ssl,certifi) appropriately support the custom model download implementation with SSL/certificate handling.Also applies to: 17-19
26-26: LGTM! Constructor changes streamline model initialization.The changes consolidate model discovery, download, and loading into a unified flow:
- Check for external models in
model_dir- If not found, download and load the pretrained models
- Store the model source for reference
The default
model_dir="."and the newmodel_urlattribute are appropriate.Also applies to: 39-39, 41-48
108-128: LGTM! Method now has clear, focused responsibility.The
model_typeparameter flagged in the previous review has been removed, and the method now cleanly loads the generic models from the specified zip file. This is a good simplification.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
quantmsrescore/model_downloader.py (1)
270-272: Consider optimizing download-only flow.MS2ModelManager.init downloads models and also initializes all model instances (MS2, RT, CCS, charge) into memory, which are immediately discarded since the manager object isn't stored. For a download-only function, this initialization overhead is unnecessary.
If MS2ModelManager exposed a static download method or a lighter initialization path that only downloads without loading models, it would be more efficient for this use case.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
quantmsrescore/model_downloader.py
🧰 Additional context used
🧬 Code graph analysis (1)
quantmsrescore/model_downloader.py (1)
quantmsrescore/ms2_model_manager.py (1)
MS2ModelManager(22-253)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Codacy Static Code Analysis
- GitHub Check: Build and Push QuantMS-Rescoring Docker Images
- GitHub Check: build (3.11)
- GitHub Check: build (3.11)
🔇 Additional comments (2)
quantmsrescore/model_downloader.py (2)
16-16: LGTM! Import added for refactored download flow.The MS2ModelManager import is correctly added to support the delegated download logic.
270-272: Previous issue resolved correctly.The None handling has been properly fixed. Line 271 correctly converts
model_dir(Optional[Path]) to a string, defaulting to "." when None, before passing to MS2ModelManager. This addresses the previous TypeError concern.
Co-authored-by: qodo-code-review[bot] <151058649+qodo-code-review[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
quantmsrescore/ms2_model_manager.py (3)
21-52: Duplicate threading configuration logic.
_configure_torch_for_hpcduplicates the functionality ofconfigure_torch_threadsinquantmsrescore/__init__.py. Both settorch.set_num_threadsandtorch.set_num_interop_threads. Additionally, calling this at module import time (line 52) happens after__init__.pyhas already configured threading.🔎 Proposed consolidation
Consider removing this local function and relying on the centralized threading configuration:
-def _configure_torch_for_hpc(n_threads: int = 1) -> None: - """ - Configure PyTorch thread settings for HPC environments. - ... - """ - try: - # Limit intra-op parallelism (within single operations) - torch.set_num_threads(n_threads) - # Limit inter-op parallelism (between independent operations) - torch.set_num_interop_threads(n_threads) - except RuntimeError: - # Threads already configured (can only be set once per process) - pass - - -# Apply PyTorch thread limits immediately -_configure_torch_for_hpc(n_threads=1) +# Threading is configured by quantmsrescore.__init__.configure_threading()If you need explicit control here, import and call
configure_torch_threadsfromquantmsrescore.
100-106: Short timeout for model downloads.The 10-second timeout on line 104 may be insufficient for downloading large model files, especially on slower network connections. This could cause unexpected failures.
🔎 Proposed fix
- requests = urllib.request.urlopen(url, context=context, timeout=10) # nosec B310 + requests = urllib.request.urlopen(url, context=context, timeout=300) # nosec B310Alternatively, consider using a streaming download approach with progress indication for better UX with large files.
332-336: Addstackleveltowarnings.warn.The warning lacks a
stacklevelargument, which means the warning will point to this line rather than the caller's location.🔎 Proposed fix
if mask_modloss is not None: warnings.warn( "mask_modloss is deprecated and will be removed in the future. To mask the modloss fragments, " - "the charged_frag_types should not include the modloss fragments." + "the charged_frag_types should not include the modloss fragments.", + stacklevel=2 )quantmsrescore/__init__.py (1)
82-139: Considerthreadpoolctlfor BLAS/OpenMP control, though it's complementary rather than a replacement.
threadpoolctlprovides a unified interface for BLAS libraries (MKL, OpenBLAS, BLIS) and OpenMP runtimes, with support for temporary/context-manager-based control. However, it doesn't cover TensorFlow-specific settings (TF_NUM_INTEROP_THREADS,TF_NUM_INTRAOP_THREADS) or NumExpr, so the current environment variable approach is necessary for comprehensive library coverage in this context.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
environment.ymlpyproject.tomlquantmsrescore/__init__.pyquantmsrescore/idxmlreader.pyquantmsrescore/ms2_model_manager.pyquantmsrescore/ms2rescore.pyquantmsrescore/transfer_learning.pyrequirements.txt
🚧 Files skipped from review as they are similar to previous changes (1)
- quantmsrescore/transfer_learning.py
🧰 Additional context used
🧬 Code graph analysis (2)
quantmsrescore/idxmlreader.py (1)
quantmsrescore/openms.py (1)
is_decoy_peptide_hit(131-149)
quantmsrescore/ms2rescore.py (1)
quantmsrescore/__init__.py (2)
configure_threading(82-139)configure_torch_threads(142-165)
🪛 Ruff (0.14.10)
quantmsrescore/__init__.py
10-10: Comment contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF003)
219-225: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
quantmsrescore/ms2_model_manager.py
93-93: Avoid specifying long messages outside the exception class
(TRY003)
97-97: Avoid specifying long messages outside the exception class
(TRY003)
104-104: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
114-118: Avoid specifying long messages outside the exception class
(TRY003)
315-315: Do not perform function call get_charged_frag_types in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
333-333: No explicit stacklevel keyword argument found
Set stacklevel=2
(B028)
342-342: Unused method argument: kwargs
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: build (3.11)
- GitHub Check: build (3.11)
- GitHub Check: Build and Push QuantMS-Rescoring Docker Images
- GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (9)
pyproject.toml (1)
45-45: LGTM!The
psutildependency addition is appropriate for the memory-aware process counting functionality. The unpinned version is consistent with other dependencies in this file.requirements.txt (1)
12-13: LGTM!The
psutildependency is added consistently withpyproject.tomlandenvironment.yml.quantmsrescore/idxmlreader.py (1)
225-235: Good performance optimization.The switch from
DataFrame.append()in a loop to collecting records in a list and constructing the DataFrame in a single operation is a significant improvement (O(n) vs O(n²)). The docstring clearly documents the rationale.quantmsrescore/ms2_model_manager.py (1)
141-160: Past issue addressed.The unused
model_typeparameter flagged in previous reviews has been removed, simplifying the API. This is a valid resolution if the functionality isn't needed.environment.yml (1)
21-21: LGTM!The
psutildependency is added consistently across all dependency manifests.quantmsrescore/ms2rescore.py (2)
7-8: LGTM!Correct import ordering - threading configuration is imported before heavy library imports (FeatureAnnotator).
225-229: Good HPC-aware threading setup.The threading configuration is correctly applied before creating the
FeatureAnnotator, ensuring that all downstream library imports respect the thread limits. The explanatory comment is helpful.quantmsrescore/__init__.py (2)
1-22: Well-documented threading control module.The module-level documentation clearly explains the HPC thread explosion problem and the solution. The approach of setting environment variables before heavy library imports is the correct pattern.
168-198: Good fallback handling forpsutil.The
get_safe_process_countfunction gracefully handles the case wherepsutilis not available by falling back to CPU count. This maintains compatibility while providing better resource management when possible.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
quantmsrescore/openms.py (1)
674-740: Clarify memory efficiency claims whenuse_cache=False.The docstring for
iter_mslevel_spectrastates it's "more memory-efficient than get_mslevel_spectra()", but this is only true whenuse_cache=True. Whenuse_cache=False, both methods callget_spectrum_lookup_indexer, which loads the entireMSExperimentinto memory.Consider updating the docstring to clarify:
- "This is more memory-efficient than get_mslevel_spectra() when you're iterating once through cached data..."
- Or note that even with
use_cache=False, the underlyingMSExperimentis fully loadedThis sets correct expectations for users trying to minimize memory usage.
📝 Suggested docstring improvement
def iter_mslevel_spectra( file_name: Union[str, Path], ms_level: int, use_cache: bool = True ) -> Generator[oms.MSSpectrum, None, None]: """ Iterate over spectra of a specific MS level (memory-efficient generator). - This is more memory-efficient than get_mslevel_spectra() when you don't - need all spectra at once. + This is more memory-efficient than get_mslevel_spectra() when you need + to iterate through spectra without storing them all in memory at once. + Note: The underlying MSExperiment is still fully loaded; this avoids + creating an intermediate list.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
quantmsrescore/alphapeptdeep.pyquantmsrescore/annotator.pyquantmsrescore/ms2pip.pyquantmsrescore/openms.py
🧰 Additional context used
🧬 Code graph analysis (3)
quantmsrescore/annotator.py (3)
quantmsrescore/idxmlreader.py (2)
psms(90-92)psms(95-99)quantmsrescore/ms2pip.py (1)
add_features(298-340)quantmsrescore/openms.py (2)
OpenMSHelper(103-751)clear_spectrum_cache(83-100)
quantmsrescore/alphapeptdeep.py (1)
quantmsrescore/ms2pip.py (1)
read_spectrum_file(556-615)
quantmsrescore/ms2pip.py (2)
quantmsrescore/alphapeptdeep.py (1)
read_spectrum_file(972-1031)quantmsrescore/openms.py (2)
OpenMSHelper(103-751)iter_mslevel_spectra(708-740)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Codacy Static Code Analysis
- GitHub Check: Build and Push QuantMS-Rescoring Docker Images
- GitHub Check: build (3.11)
- GitHub Check: build (3.11)
🔇 Additional comments (4)
quantmsrescore/alphapeptdeep.py (1)
972-1031: LGTM! Memory-efficient iterator with caching support.The change from
get_mslevel_spectratoiter_mslevel_spectraimproves memory efficiency by using a generator instead of loading all spectra into memory at once. The addition of theuse_cacheparameter (defaulting toTrue) enables sharing spectrum data across MS2PIP and AlphaPeptDeep feature generators, reducing redundant file I/O.The docstring clearly documents the new parameter and its purpose.
quantmsrescore/ms2pip.py (1)
556-615: LGTM! Consistent caching pattern across feature generators.This change mirrors the identical improvement in
quantmsrescore/alphapeptdeep.py, ensuring both MS2PIP and AlphaPeptDeep benefit from the same caching and memory-efficiency improvements. The consistency across feature generators is good for maintainability.quantmsrescore/annotator.py (2)
282-284: Good practice: explicit memory cleanup after annotation.Calling
clear_spectrum_cache()andgc.collect()after annotation completes is good for releasing resources promptly. This aligns with the broader memory optimization effort in this PR and ensures that cached spectrum data doesn't persist unnecessarily.
702-726: Memory-efficient model comparison, contingent on shallow copy safety.The replacement of
copy.deepcopy()with_shallow_copy_psm_list()is excellent for memory efficiency, and the explicit cleanup (del+gc.collect()) after each model evaluation ensures memory is promptly released.However, the safety of this change depends on the shallow copy implementation being correct (see comment on lines 20-63). If shared mutable attributes are modified during
add_features(), this could cause issues.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
quantmsrescore/idxmlreader.py (1)
307-308: Consider specifying columns for empty DataFrame consistency.When
psm_recordsis empty, the resulting DataFrame has no columns. While this is likely a rare edge case, specifying the expected column names ensures schema consistency.🔎 Optional improvement to maintain schema
- self._psms_df = pd.DataFrame(psm_records) if psm_records else pd.DataFrame() + self._psms_df = pd.DataFrame(psm_records) if psm_records else pd.DataFrame( + columns=["sequence", "charge", "mods", "mod_sites", "nce", + "provenance_data", "instrument", "spectrum_ref", "filename", + "is_decoy", "rank", "score"] + )
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
quantmsrescore/idxmlreader.py
🧰 Additional context used
🧬 Code graph analysis (1)
quantmsrescore/idxmlreader.py (1)
quantmsrescore/openms.py (1)
is_decoy_peptide_hit(193-211)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Codacy Static Code Analysis
- GitHub Check: Build and Push QuantMS-Rescoring Docker Images
- GitHub Check: build (3.11)
🔇 Additional comments (3)
quantmsrescore/idxmlreader.py (3)
225-231: Excellent documentation of the performance rationale.The added notes clearly explain why the list-based approach is preferred over DataFrame.append() in a loop, making the design decision transparent to future maintainers.
233-235: Well-structured initialization with clear performance comments.The list-based approach is the correct pattern for building DataFrames efficiently.
290-304: Excellent performance improvement and bug fix.The refactor from DataFrame append to list append delivers the promised O(n) performance, and line 300 now correctly handles the
Nonefilename case (addressing the previous TypeError concern).
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
quantmsrescore/annotator.py (1)
20-63: Fix default values to preserve None semantics.Lines 55, 58, and 59 incorrectly use empty containers as fallback values instead of preserving
None:
- Line 55:
protein_list=psm.protein_list.copy() if psm.protein_list else []→ should beelse None- Line 58:
provenance_data=psm.provenance_data.copy() if psm.provenance_data else {}→ should beelse None- Line 59:
metadata=psm.metadata.copy() if psm.metadata else {}→ should beelse NoneUsing empty containers instead of
Nonechanges the semantics and may break code that checks forNoneexplicitly.🔎 Proposed fix
- protein_list=psm.protein_list.copy() if psm.protein_list else [], + protein_list=psm.protein_list.copy() if psm.protein_list else None, rank=psm.rank, source=psm.source, - provenance_data=psm.provenance_data.copy() if psm.provenance_data else {}, # Can share as keys are read-only - metadata=psm.metadata.copy() if psm.metadata else {}, + provenance_data=psm.provenance_data.copy() if psm.provenance_data else None, + metadata=psm.metadata.copy() if psm.metadata else None, rescoring_features={}, # Fresh dict - this is what will be modified
🧹 Nitpick comments (4)
quantmsrescore/openms.py (1)
176-211: Optimize the enumeration path in organize_psms_by_spectrum_id.Line 206 uses
enumerated_psm_list.index(item)which is O(n) and inefficient when repeatedly called for non-tuple items. If the input is a raw PSM list, consider enumerating it once upfront:def organize_psms_by_spectrum_id( enumerated_psm_list: List[Any] ) -> Dict[str, List[Tuple[int, Any]]]: from collections import defaultdict psms_by_specid = defaultdict(list) # Check first item to determine format if enumerated_psm_list and not (isinstance(enumerated_psm_list[0], tuple) and len(enumerated_psm_list[0]) == 2): # Enumerate once if not already enumerated enumerated_psm_list = list(enumerate(enumerated_psm_list)) for psm_index, psm in enumerated_psm_list: psms_by_specid[str(psm.spectrum_id)].append((psm_index, psm)) return psms_by_specidThis avoids O(n²) behavior when processing raw PSM lists.
quantmsrescore/ms2_model_manager.py (3)
108-108: Handle edge case where model path has no directory component.If
model_zip_file_pathis just a filename (e.g.,"model.zip"),os.path.dirname()returns an empty string, andos.makedirs("")may behave unexpectedly on some systems.🔎 Proposed defensive fix
- os.makedirs(os.path.dirname(model_zip_file_path), exist_ok=True) + model_dir = os.path.dirname(model_zip_file_path) + if model_dir: + os.makedirs(model_dir, exist_ok=True)
356-356: Verify whether**kwargsis needed for interface compatibility.The
**kwargsparameter is unused in the method body. If this method overrides a parent class method or implements an interface that requires**kwargs, the parameter is justified. Otherwise, it can be removed.
347-347: Addstacklevel=2to improve warning clarity.The warning should point to the caller's location, not the function definition. Adding
stacklevel=2improves the developer experience when diagnosing deprecation warnings.🔎 Proposed fix
warnings.warn( "mask_modloss is deprecated and will be removed in the future. To mask the modloss fragments, " "the charged_frag_types should not include the modloss fragments.", + stacklevel=2 )
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
quantmsrescore/alphapeptdeep.pyquantmsrescore/annotator.pyquantmsrescore/ms2_model_manager.pyquantmsrescore/ms2pip.pyquantmsrescore/openms.pyquantmsrescore/transfer_learning.py
🧰 Additional context used
🧬 Code graph analysis (3)
quantmsrescore/annotator.py (1)
quantmsrescore/openms.py (1)
clear_spectrum_cache(143-166)
quantmsrescore/alphapeptdeep.py (2)
quantmsrescore/openms.py (5)
OpenMSHelper(237-885)get_compiled_regex(53-82)organize_psms_by_spectrum_id(176-211)calculate_correlations(214-234)iter_mslevel_spectra(842-874)quantmsrescore/ms2pip.py (1)
read_spectrum_file(549-608)
quantmsrescore/ms2pip.py (2)
quantmsrescore/openms.py (5)
OpenMSHelper(237-885)get_compiled_regex(53-82)organize_psms_by_spectrum_id(176-211)calculate_correlations(214-234)iter_mslevel_spectra(842-874)quantmsrescore/alphapeptdeep.py (1)
read_spectrum_file(960-1019)
🪛 Ruff (0.14.10)
quantmsrescore/ms2_model_manager.py
99-99: Avoid specifying long messages outside the exception class
(TRY003)
103-103: Avoid specifying long messages outside the exception class
(TRY003)
112-112: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
128-132: Avoid specifying long messages outside the exception class
(TRY003)
329-329: Do not perform function call get_charged_frag_types in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
347-347: No explicit stacklevel keyword argument found
Set stacklevel=2
(B028)
356-356: Unused method argument: kwargs
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Codacy Static Code Analysis
- GitHub Check: Build and Push QuantMS-Rescoring Docker Images
- GitHub Check: build (3.11)
- GitHub Check: build (3.11)
🔇 Additional comments (7)
quantmsrescore/ms2pip.py (1)
30-35: LGTM! Caching integration is well-implemented.The changes successfully integrate the new caching infrastructure:
- Import of shared utilities (
get_compiled_regex,organize_psms_by_spectrum_id,calculate_correlations) eliminates code duplicationuse_cacheparameter enables memory-efficient spectrum loading with proper documentation- Consistent usage of cached utilities throughout the processing pipeline
Also applies to: 549-608, 812-816
quantmsrescore/openms.py (1)
40-82: LGTM! Caching infrastructure is well-designed.The new caching infrastructure provides:
- Bounded LRU cache (MAX_CACHE_SIZE=3) preventing unbounded memory growth
- Efficient regex compilation caching with fallback handling
- Clear separation of concerns between get/iter spectrum methods
- Explicit cache clearing for memory management
Also applies to: 85-141, 143-167, 214-235, 808-874
quantmsrescore/alphapeptdeep.py (1)
10-15: LGTM! Consistent integration with caching infrastructure.The changes mirror the ms2pip.py implementation:
- Shared utilities properly imported and used
use_cacheparameter consistently applied- Regex compilation and PSM organization follow the same efficient pattern
Also applies to: 831-835, 902-907, 960-1019
quantmsrescore/annotator.py (1)
282-284: LGTM! Memory management properly implemented.The shallow copy approach combined with explicit cache clearing is effective:
_shallow_copy_psm_listavoids deep copy overhead while creating freshrescoring_featuresdictsclear_spectrum_cache()andgc.collect()calls prevent memory leaks- Strategic cleanup between model evaluations reduces peak memory usage
Also applies to: 702-726
quantmsrescore/transfer_learning.py (2)
167-170: Verify threading configuration aligns with user expectations.Lines 167-170 hardcode
n_threads=1regardless of theprocessesparameter value. While the help text explains "Each process uses 1 internal thread to avoid HPC resource contention," this design may be surprising:
- User passes
--processes=4expecting 4-way parallelism- Code forces internal threading to 1, which is correct for avoiding oversubscription in multiprocessing
- However, the relationship between
processes(external parallelism) and internal threading (forced to 1) should be clearerConsider:
- Clarifying in the help text that
processescontrols the multiprocessing pool size, not internal threads- Adding a code comment explaining why threading is forced to 1 in HPC contexts
- Verifying that peptdeep/torch models respect these thread limits
5-6: LGTM! Logging and caching integration is correct.The changes properly integrate logging and caching utilities:
- Log level flows from CLI → trainer → configure_logging
get_compiled_regexreplaces inline regex compilation- Threading configuration documented in help text
Also applies to: 9-9, 100-100, 184-184, 201-201, 219-221, 330-331
quantmsrescore/ms2_model_manager.py (1)
155-175: Verify whethermodel_typesupport was intentionally removed.Previous review comments suggested adding a
model_typeparameter to support loading different model variants (e.g., 'phospho', 'HLA', 'digly'), but the current implementation hardcodes"generic/"paths for all models. The past suggestion was marked as "manually_applied" (lines 168-177), yet the parameter is absent.Was the
model_typefeature intentionally removed to simplify the API, or should it still be implemented per the previous review?
| if os.path.exists(model_zip_file_path): | ||
| if not overwrite: | ||
| raise FileExistsError(f"Model file already exists: {model_zip_file_path}") | ||
| # File exists and overwrite is True, skip download | ||
| else: |
There was a problem hiding this comment.
Clarify or fix the overwrite parameter behavior.
The current logic skips the download when the file exists and overwrite=True, rather than actually overwriting the file. This is misleading:
- Line 102-103: If file exists and
overwrite=False, raise an error ✓ - Line 104: Comment says "File exists and overwrite is True, skip download"
- Expected: If
overwrite=True, the file should be re-downloaded/overwritten
Consider either:
- Rename the parameter to
skip_if_existsorallow_existingfor clarity, or - Implement actual overwrite behavior by removing the existing file before download
🔎 Proposed fix to implement actual overwrite behavior
if os.path.exists(model_zip_file_path):
if not overwrite:
raise FileExistsError(f"Model file already exists: {model_zip_file_path}")
- # File exists and overwrite is True, skip download
+ # File exists and overwrite is True, remove it to re-download
+ os.remove(model_zip_file_path)
+
+ logging.info(f"Downloading pretrained models from {url} to {model_zip_file_path} ...")
+ try:
+ os.makedirs(os.path.dirname(model_zip_file_path), exist_ok=True)
- else:
- logging.info(f"Downloading pretrained models from {url} to {model_zip_file_path} ...")
- try:
- os.makedirs(os.path.dirname(model_zip_file_path), exist_ok=True)Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.10)
103-103: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In quantmsrescore/ms2_model_manager.py around lines 101 to 105, the current
logic treats overwrite=True as "skip download" which is misleading; change
behavior so that if os.path.exists(model_zip_file_path) and overwrite is True,
delete the existing file (os.remove or Path.unlink) before proceeding to
download and writing the new file, and if overwrite is False keep the existing
raise FileExistsError path; ensure deletion errors are handled or propagated
with a clear error message and that any parent directories and permissions are
respected before attempting the download.
|
Ok. Can you check the last three Major Code Rabbit comments? They make sense. |
|
Im on it. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
quantmsrescore/ms2_model_manager.py (1)
362-366: Addstacklevel=2towarnings.warnfor accurate caller location.Without an explicit
stacklevel, the warning will point to this line rather than the caller's code that passed the deprecatedmask_modlossargument.🔎 Proposed fix
if mask_modloss is not None: warnings.warn( "mask_modloss is deprecated and will be removed in the future. To mask the modloss fragments, " - "the charged_frag_types should not include the modloss fragments." + "the charged_frag_types should not include the modloss fragments.", + stacklevel=2, )quantmsrescore/__init__.py (2)
82-140: Consider usingthreadpoolctlfor cleaner thread management.The current manual environment variable approach works, but
threadpoolctlprovides a more maintainable solution. It's recommended by NumPy and would simplify this function significantly.Based on past discussion in this PR, the team has already identified threadpoolctl as a preferable approach. Consider migrating to it in a follow-up.
Example using threadpoolctl
def configure_threading(n_threads: Optional[int] = None, verbose: bool = False) -> None: """Configure thread counts using threadpoolctl.""" if n_threads is None: n_threads = _DEFAULT_THREADS_PER_PROCESS try: from threadpoolctl import threadpool_limits # This context manager can be used globally or per-operation threadpool_limits(limits=n_threads, user_api='blas') threadpool_limits(limits=n_threads, user_api='openmp') except ImportError: # Fallback to current env var approach pass # TensorFlow specific settings (not covered by threadpoolctl) os.environ.setdefault("TF_FORCE_GPU_ALLOW_GROWTH", "true") os.environ.setdefault("TF_CPP_MIN_LOG_LEVEL", "2")Note: threadpoolctl needs to be added to dependencies. See https://pypi.org/project/threadpoolctl/
228-234: Optional: Consider sorting__all__alphabetically.The current grouping is logical (threading functions, then helper functions, then metadata), but alphabetical sorting can improve consistency.
Alphabetically sorted version
__all__ = [ + "__version__", + "calculate_optimal_parallelism", "configure_threading", "configure_torch_threads", - "calculate_optimal_parallelism", "get_safe_process_count", - "__version__", ]
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
quantmsrescore/__init__.pyquantmsrescore/ms2_model_manager.py
🧰 Additional context used
🪛 Ruff (0.14.10)
quantmsrescore/__init__.py
10-10: Comment contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF003)
228-234: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
quantmsrescore/ms2_model_manager.py
110-110: Avoid specifying long messages outside the exception class
(TRY003)
114-114: Avoid specifying long messages outside the exception class
(TRY003)
123-123: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
139-143: Avoid specifying long messages outside the exception class
(TRY003)
363-363: No explicit stacklevel keyword argument found
Set stacklevel=2
(B028)
372-372: Unused method argument: kwargs
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Codacy Static Code Analysis
- GitHub Check: build (3.11)
- GitHub Check: build (3.11)
- GitHub Check: Build and Push QuantMS-Rescoring Docker Images
🔇 Additional comments (12)
quantmsrescore/ms2_model_manager.py (8)
1-20: LGTM!The new imports (
urllib,ssl,certifi,shutil) are appropriate for implementing secure streaming downloads with proper SSL certificate handling.
22-56: LGTM!The opt-in approach via
QUANTMS_HPC_MODEenvironment variable addresses the previous concern about automatic thread configuration at import time. The publicconfigure_torch_for_hpc()function allows explicit control when needed.
59-87: LGTM!The initialization logic correctly handles both scenarios: loading external models when present, or downloading and loading the default pretrained models. The default
model_dir="."is a reasonable choice for the current working directory.
126-143: LGTM!Good defensive programming: cleaning up partial downloads on failure and validating the downloaded file is a valid zip before proceeding.
166-185: LGTM!The method correctly loads all model types (MS2, RT, CCS, Charge) from the downloaded zip file. The previous concern about the unused
model_typeparameter has been addressed by removing it.
338-351: LGTM!The fix for the default argument issue (Ruff B008) is correct. Computing
charged_frag_typesinside the method whenNoneis passed ensures it's evaluated at call time rather than definition time.
368-394: LGTM!The
**kwargsparameter is retained for interface compatibility with the parent class. The batch prediction logic correctly handles normalization, masking, and both ordered and sliced updates.
420-464: LGTM!The function correctly handles updating DataFrame slices with appropriate optimizations: fast numpy slicing when all columns are updated, and
ilocwith specific column indices when only certain fragment types need updating.quantmsrescore/__init__.py (4)
24-79: LGTM!The resource calculation logic correctly balances CPU and memory constraints for HPC environments. The conservative default of 1 thread per process prevents the thread explosion issue described in the header comments.
142-165: LGTM!PyTorch thread configuration is implemented correctly with proper error handling for both missing PyTorch installation and already-configured threads.
168-198: LGTM!Resource-aware process calculation with appropriate fallback when psutil is unavailable. The conservative approach of returning at least 1 process ensures the function never blocks execution.
201-216: LGTM!The opt-in automatic configuration via
QUANTMS_HPC_MODEis well-designed. The comments clearly explain both the automatic and explicit control approaches, and the CLI commands handle threading explicitly regardless of this setting.
| logging.info(f"Downloading pretrained models from {url} to {model_zip_file_path} ...") | ||
| try: | ||
| os.makedirs(os.path.dirname(model_zip_file_path), exist_ok=True) | ||
| context = ssl.create_default_context(cafile=certifi.where()) | ||
| # Use streaming download with longer timeout for large model files | ||
| # timeout=300s (5 min) for slow connections; stream in 1MB chunks | ||
| with urllib.request.urlopen(url, context=context, timeout=300) as response: # nosec B310 | ||
| with open(model_zip_file_path, "wb") as out_file: | ||
| shutil.copyfileobj(response, out_file, length=1024 * 1024) # 1MB chunks |
There was a problem hiding this comment.
Guard against empty dirname before calling os.makedirs.
When model_zip_file_path is a simple filename without a directory component (e.g., "pretrained_models_v3.zip"), os.path.dirname() returns an empty string. Calling os.makedirs("", exist_ok=True) can raise FileNotFoundError on some platforms.
🔎 Proposed fix
logging.info(f"Downloading pretrained models from {url} to {model_zip_file_path} ...")
try:
- os.makedirs(os.path.dirname(model_zip_file_path), exist_ok=True)
+ parent_dir = os.path.dirname(model_zip_file_path)
+ if parent_dir:
+ os.makedirs(parent_dir, exist_ok=True)
context = ssl.create_default_context(cafile=certifi.where())🧰 Tools
🪛 Ruff (0.14.10)
123-123: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
🤖 Prompt for AI Agents
In quantmsrescore/ms2_model_manager.py around lines 117 to 125, calling
os.makedirs(os.path.dirname(model_zip_file_path), exist_ok=True) can fail when
model_zip_file_path has no directory component (dirname == ""), so guard against
an empty dirname: compute dirpath = os.path.dirname(model_zip_file_path) and
only call os.makedirs(dirpath, exist_ok=True) if dirpath is non-empty (or
alternatively replace empty dirname with "." to mean current directory); keep
the rest of the download logic unchanged.
PR Type
Enhancement, Bug fix
Description
Refactor model downloading logic with custom implementation
_download_modelswith custom_download_modelsmethodSimplify AlphaPeptDeep model download in
model_downloader.pyMS2ModelManagerfor unified model handlingAdd logging level configuration to transfer learning CLI
--log_levelparameter for CLI commandAlphaPeptdeepTrainerinitializationAdd new MS2PIP feature mapping for cosine similarity
Code formatting and style improvements
Diagram Walkthrough
File Walkthrough
constants.py
Add cosine similarity feature mappingquantmsrescore/constants.py
model_downloader.py
Simplify model download delegationquantmsrescore/model_downloader.py
MS2ModelManagerfor unified model handlingdownload_alphapeptdeep_modelsfunction to delegate toMS2ModelManager_download_modelsms2_model_manager.py
Implement custom model download with validationquantmsrescore/ms2_model_manager.py
_download_modelsmethod with SSL/certificate supportis_model_zipmodel_dirparameter fromNoneto"."load_installed_modelsmethod with configurable model pathurllib,ssl, andcertififor secure downloadsMODEL_DOWNLOAD_INSTRUCTIONSandis_model_ziptransfer_learning.py
Add logging level configuration to CLIquantmsrescore/transfer_learning.py
--log_levelCLI option with default value "info"log_levelparameter toAlphaPeptdeepTrainerinitializationAlphaPeptdeepTrainer.__init__usingconfigure_loggingparameter
Dockerfile
Update Docker working directoryDockerfile
/appto/workSummary by CodeRabbit
New Features
Improvements
Chores
✏️ Tip: You can customize this high-level summary in your review settings.