Skip to content

major changes in quantms-rescoring#59

Merged
ypriverol merged 39 commits intomainfrom
dev
Jan 4, 2026
Merged

major changes in quantms-rescoring#59
ypriverol merged 39 commits intomainfrom
dev

Conversation

@ypriverol
Copy link
Member

@ypriverol ypriverol commented Jan 2, 2026

PR Type

Enhancement, Bug fix


Description

  • Refactor model downloading logic with custom implementation

    • Replace peptdeep's _download_models with custom _download_models method
    • Add SSL/certificate handling for robust downloads
    • Implement model zip validation
  • Simplify AlphaPeptDeep model download in model_downloader.py

    • Delegate to MS2ModelManager for unified model handling
    • Remove redundant file copying logic
  • Add logging level configuration to transfer learning CLI

    • New --log_level parameter for CLI command
    • Configure logging in AlphaPeptdeepTrainer initialization
  • Add new MS2PIP feature mapping for cosine similarity

  • Code formatting and style improvements


Diagram Walkthrough

flowchart LR
  A["model_downloader.py"] -->|delegates to| B["MS2ModelManager"]
  B -->|custom download| C["_download_models method"]
  C -->|SSL/certifi| D["urllib download"]
  D -->|validates| E["is_model_zip check"]
  F["transfer_learning CLI"] -->|new param| G["log_level"]
  G -->|configure| H["logging_config"]
Loading

File Walkthrough

Relevant files
Enhancement
constants.py
Add cosine similarity feature mapping                                       

quantmsrescore/constants.py

  • Add new MS2PIP feature mapping for cosine similarity metric
  • Map "MS2PIP:Cos" to "cos" in feature dictionary
+1/-0     
model_downloader.py
Simplify model download delegation                                             

quantmsrescore/model_downloader.py

  • Import MS2ModelManager for unified model handling
  • Simplify download_alphapeptdeep_models function to delegate to
    MS2ModelManager
  • Remove manual model copying and file handling logic
  • Remove direct imports from peptdeep's _download_models
+2/-35   
ms2_model_manager.py
Implement custom model download with validation                   

quantmsrescore/ms2_model_manager.py

  • Add custom _download_models method with SSL/certificate support
  • Implement model zip validation using is_model_zip
  • Change default model_dir parameter from None to "."
  • Add load_installed_models method with configurable model path
  • Import urllib, ssl, and certifi for secure downloads
  • Update imports from peptdeep to use MODEL_DOWNLOAD_INSTRUCTIONS and
    is_model_zip
  • Fix code formatting and indentation throughout class
+87/-26 
transfer_learning.py
Add logging level configuration to CLI                                     

quantmsrescore/transfer_learning.py

  • Add --log_level CLI option with default value "info"
  • Pass log_level parameter to AlphaPeptdeepTrainer initialization
  • Configure logging in AlphaPeptdeepTrainer.__init__ using
    configure_logging
  • Update function signatures and docstrings to include log_level
    parameter
+14/-3   
Configuration changes
Dockerfile
Update Docker working directory                                                   

Dockerfile

  • Change working directory from /app to /work
+1/-1     

Summary by CodeRabbit

  • New Features

    • CLI log-level option; global threading controls for HPC; streaming spectrum reader with optional caching and explicit cache-clear.
  • Improvements

    • More robust model download flow with validation and clearer failure guidance; offline model download instructions.
    • Added MS2PIP "Cos" feature mapping.
    • Memory/performance: faster PSM construction, shallower PSM copies, explicit GC/cache cleanup.
  • Chores

    • Container runtime paths adjusted for matplotlib temp files; psutil added to dependencies; README updated with HPC/Nextflow guidance.

✏️ Tip: You can customize this high-level summary in your review settings.

@ypriverol ypriverol requested a review from daichengxin January 2, 2026 16:00
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 2, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Consolidates MS2 model download/load into MS2ModelManager (URL-based download with SSL and zip validation), adds psutil and an MS2PIP feature mapping, and introduces threading, logging, spectrum caching/streaming, memory-saving PSM handling, and Docker runtime ENV/WORKDIR adjustments to /app.

Changes

Cohort / File(s) Change Summary
Container
Dockerfile
Runtime WORKDIR and environment updates: final-stage HOME, PEPTDEEP_HOME, and MPLCONFIGDIR switched to /app; working dir and permissions updated to target /app.
Model manager & downloader
quantmsrescore/ms2_model_manager.py, quantmsrescore/model_downloader.py
Consolidated model download/load into MS2ModelManager with model_url, _download_models() (URL validation, urllib+certifi streaming, overwrite handling, zip validation), and load_installed_models(); model_downloader.py delegates to manager.
Threading & core utilities
quantmsrescore/__init__.py, quantmsrescore/transfer_learning.py, quantmsrescore/ms2rescore.py
New threading helpers (configure_threading, configure_torch_threads, calculate_optimal_parallelism, get_safe_process_count), automatic/explicit HPC-safe configuration, CLI/trainer log_level propagation, and pre-import thread limits applied in key flows.
I/O caching & streaming
quantmsrescore/openms.py, quantmsrescore/idxmlreader.py, quantmsrescore/annotator.py
Added bounded LRU spectrum cache, compiled-regex cache, get_compiled_regex, get_cached_spectrum_data, clear_spectrum_cache, organize_psms_by_spectrum_id, calculate_correlations; OpenMSHelper gains iter_mslevel_spectra and cache-aware get_mslevel_spectra; idxmlreader builds DataFrame from list-of-records; annotator uses shallow PSM copies and explicit GC/cache clearing.
Spectrum readers & helpers
quantmsrescore/ms2pip.py, quantmsrescore/alphapeptdeep.py
read_spectrum_file adds use_cache: bool = True; switched to iter_mslevel_spectra streaming; replaced local regex/organization helpers with shared cached helpers.
Feature constants
quantmsrescore/constants.py
Added "MS2PIP:Cos": "cos" to MS2PIP_FEATURES.
Packaging / deps
environment.yml, pyproject.toml, requirements.txt
Added psutil to environment/pyproject/requirements for memory-aware process calculations.
Docs
README.md
New "HPC and Nextflow Integration" section describing threading, Nextflow examples, and offline model download instructions.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant CLI as Caller / CLI
  participant MD as model_downloader.py
  participant MM as MS2ModelManager
  participant HTTP as Remote (model_url)
  participant FS as Filesystem

  rect rgb(230,245,255)
    CLI->>MD: request download/load(model_dir)
    MD->>MM: instantiate MS2ModelManager(model_dir)
  end

  rect rgb(245,255,230)
    MM->>FS: check existing model files/zip
    alt local models present
      FS-->>MM: return installed models
    else
      MM->>HTTP: GET model_url (SSL via certifi)
      HTTP-->>MM: stream zip bytes
      MM->>FS: write zip to download path
      MM->>MM: validate zip (is_model_zip)
    end
  end

  rect rgb(255,250,230)
    MM->>MM: load_installed_models(from zip or local)
    MM-->>MD: models loaded / ready
    MD-->>CLI: success
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

Review effort 4/5

Poem

🐰 I dug a tunnel to fetch a zip,
I hopped through logs and gave it a tip,
Threads kept tidy, caches swept neat,
Models arrive checked, ready to greet,
I nibble bytes, then bounce off my feet.

Pre-merge checks

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 76.19% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'major changes in quantms-rescoring' is vague and generic, using non-descriptive language that doesn't convey the specific nature or scope of the changes. Consider using a more specific title that captures the main change, such as 'Refactor model downloading and add HPC threading optimizations' or 'Implement custom model download with threading configuration'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6bf71f8 and df7fe48.

📒 Files selected for processing (1)
  • Dockerfile
🚧 Files skipped from review as they are similar to previous changes (1)
  • Dockerfile
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Codacy Static Code Analysis
  • GitHub Check: Build and Push QuantMS-Rescoring Docker Images
  • GitHub Check: build (3.11)
  • GitHub Check: build (3.11)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link
Contributor

qodo-code-review bot commented Jan 2, 2026

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
Unverified remote download

Description: The new _download_models implementation downloads and writes a remote ZIP from a network
location (self.model_url) without any integrity verification (e.g., pinned hash/signature)
and without bounding the response size (requests.read()), creating a realistic
supply-chain and denial-of-service risk if the download source or path is interfered with.

ms2_model_manager.py [55-84]

Referred Code
def _download_models(self, model_zip_file_path: str, overwrite: bool = True) -> None:
    """Download models if not done yet."""
    url = self.model_url
    parsed = urllib.parse.urlparse(url)
    if parsed.scheme not in ("http", "https"):
        raise ValueError(f"Disallowed URL scheme: {parsed.scheme}")

    if not os.path.exists(model_zip_file_path):
        if not overwrite and os.path.exists(model_zip_file_path):
            raise FileExistsError(f"Model file already exists: {model_zip_file_path}")

        logging.info(f"Downloading pretrained models from {url} to {model_zip_file_path} ...")
        try:
            os.makedirs(os.path.dirname(model_zip_file_path), exist_ok=True)
            context = ssl.create_default_context(cafile=certifi.where())
            requests = urllib.request.urlopen(url, context=context, timeout=10)  # nosec B310
            with open(model_zip_file_path, "wb") as f:
                f.write(requests.read())
        except Exception as e:
            raise FileNotFoundError(
                f"Downloading model failed: {e}.\n" + MODEL_DOWNLOAD_INSTRUCTIONS


 ... (clipped 9 lines)
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🔴
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status:
Misleading variable name: The new code assigns the result of urllib.request.urlopen(...) to a variable named
requests, which is misleading and can be confused with the requests library or a
collection of requests.

Referred Code
requests = urllib.request.urlopen(url, context=context, timeout=10)  # nosec B310
with open(model_zip_file_path, "wb") as f:
    f.write(requests.read())

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Overwrite logic bug: The new _download_models implementation contains an unreachable/incorrect overwrite check
(if not os.path.exists(...) then checking existence again) which prevents the overwrite
flag from behaving as intended and can cause incorrect behavior on existing files.

Referred Code
if not os.path.exists(model_zip_file_path):
    if not overwrite and os.path.exists(model_zip_file_path):
        raise FileExistsError(f"Model file already exists: {model_zip_file_path}")

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Unvalidated log level: The new --log_level CLI input is passed directly into configure_logging (after .upper())
without validating against an allowlist of known levels, which can lead to unexpected
logging configuration behavior.

Referred Code
@click.option("--log_level", help="Logging level (default: `info`)", default="info")
@click.pass_context
def transfer_learning(
        ctx,
        idxml: str,
        mzml,
        save_model_dir: str,
        processes,
        ms2_model_dir,
        ms2_tolerance,
        ms2_tolerance_unit,
        calibration_set_size,
        spectrum_id_pattern: str,
        consider_modloss,
        transfer_learning_test_ratio,
        epoch_to_train_ms2,
        force_transfer_learning,
        log_level
):
    """
    Annotate PSMs in an idXML file with additional features using specified models.


 ... (clipped 59 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Missing user context: The new model download action is logged but the logs do not include an actor/user
identifier (and may not be sufficient to reconstruct who initiated the download) which may
be required for audit trails depending on deployment context.

Referred Code
def _download_models(self, model_zip_file_path: str, overwrite: bool = True) -> None:
    """Download models if not done yet."""
    url = self.model_url
    parsed = urllib.parse.urlparse(url)
    if parsed.scheme not in ("http", "https"):
        raise ValueError(f"Disallowed URL scheme: {parsed.scheme}")

    if not os.path.exists(model_zip_file_path):
        if not overwrite and os.path.exists(model_zip_file_path):
            raise FileExistsError(f"Model file already exists: {model_zip_file_path}")

        logging.info(f"Downloading pretrained models from {url} to {model_zip_file_path} ...")
        try:
            os.makedirs(os.path.dirname(model_zip_file_path), exist_ok=True)
            context = ssl.create_default_context(cafile=certifi.where())
            requests = urllib.request.urlopen(url, context=context, timeout=10)  # nosec B310
            with open(model_zip_file_path, "wb") as f:
                f.write(requests.read())
        except Exception as e:
            raise FileNotFoundError(
                f"Downloading model failed: {e}.\n" + MODEL_DOWNLOAD_INSTRUCTIONS


 ... (clipped 11 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status:
Detailed error surfaced: The new download failure path raises an exception message that includes the raw underlying
exception text, which may expose internal details to end-users depending on how this
exception is surfaced by the CLI/application.

Referred Code
except Exception as e:
    raise FileNotFoundError(
        f"Downloading model failed: {e}.\n" + MODEL_DOWNLOAD_INSTRUCTIONS
    ) from e

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Unstructured path logging: The new logging logs the full download URL and local filesystem path as plain text (not
structured), which may leak sensitive environment/path details depending on where logs are
shipped and who can access them.

Referred Code
logging.info(f"Downloading pretrained models from {url} to {model_zip_file_path} ...")
try:

Learn more about managing compliance generic rules or creating your own custom rules

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link
Contributor

qodo-code-review bot commented Jan 2, 2026

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix download logic and memory usage
Suggestion Impact:The commit reworked the existing-file/overwrite conditional in _download_models, addressing the logical flaw where the overwrite check was unreachable. However, it did not implement the suggested streaming download (e.g., shutil.copyfileobj) and the behavior differs from the suggestion (it raises on existing file when overwrite=False rather than skipping download).

code diff:

@@ -59,10 +59,11 @@
         if parsed.scheme not in ("http", "https"):
             raise ValueError(f"Disallowed URL scheme: {parsed.scheme}")
 
-        if not os.path.exists(model_zip_file_path):
-            if not overwrite and os.path.exists(model_zip_file_path):
+        if os.path.exists(model_zip_file_path):
+            if not overwrite:
                 raise FileExistsError(f"Model file already exists: {model_zip_file_path}")
-
+            # File exists and overwrite is True, skip download
+        else:
             logging.info(f"Downloading pretrained models from {url} to {model_zip_file_path} ...")

Fix the file download logic in _download_models to correctly handle existing
files and avoid high memory usage by streaming the download instead of reading
the entire file into memory.

quantmsrescore/ms2_model_manager.py [55-84]

 def _download_models(self, model_zip_file_path: str, overwrite: bool = True) -> None:
     """Download models if not done yet."""
     url = self.model_url
     parsed = urllib.parse.urlparse(url)
     if parsed.scheme not in ("http", "https"):
         raise ValueError(f"Disallowed URL scheme: {parsed.scheme}")
 
-    if not os.path.exists(model_zip_file_path):
-        if not overwrite and os.path.exists(model_zip_file_path):
-            raise FileExistsError(f"Model file already exists: {model_zip_file_path}")
-
+    if os.path.exists(model_zip_file_path) and not overwrite:
+        logging.info(f"Model file {model_zip_file_path} already exists, skipping download.")
+    else:
         logging.info(f"Downloading pretrained models from {url} to {model_zip_file_path} ...")
         try:
             os.makedirs(os.path.dirname(model_zip_file_path), exist_ok=True)
             context = ssl.create_default_context(cafile=certifi.where())
-            requests = urllib.request.urlopen(url, context=context, timeout=10)  # nosec B310
-            with open(model_zip_file_path, "wb") as f:
-                f.write(requests.read())
+            with urllib.request.urlopen(url, context=context, timeout=60) as requests, open(model_zip_file_path, "wb") as f:
+                shutil.copyfileobj(requests, f)
         except Exception as e:
             raise FileNotFoundError(
                 f"Downloading model failed: {e}.\n" + MODEL_DOWNLOAD_INSTRUCTIONS
             ) from e
 
         logging.info("Successfully downloaded pretrained models.")
+
     if not is_model_zip(model_zip_file_path):
         raise ValueError(
             f"Local model file is not a valid zip: {model_zip_file_path}.\n"
             f"Please delete this file and try again.\n"
             f"Or: {MODEL_DOWNLOAD_INSTRUCTIONS}"
         )

[Suggestion processed]

Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a logical flaw that makes the overwrite check unreachable and a performance issue by reading the entire downloaded file into memory. The proposed fix corrects the logic and improves memory efficiency.

Medium
Guard None model_dir to avoid errors
Suggestion Impact:The code was changed to handle a falsy/None model_dir by computing a target_dir and passing that to MS2ModelManager, preventing passing None directly. The implementation differs from the suggestion by using "." as the fallback rather than calling MS2ModelManager() without a model_dir.

code diff:

-        # Download models to default location
-        MS2ModelManager(model_dir=model_dir)
+        # Download models to specified location or default
+        target_dir = str(model_dir) if model_dir else "."
+        MS2ModelManager(model_dir=target_dir)

In download_alphapeptdeep_models, add a check to ensure model_dir is not None
before passing it to MS2ModelManager to prevent a potential TypeError.

quantmsrescore/model_downloader.py [271]

-MS2ModelManager(model_dir=model_dir)
+if model_dir is not None:
+    MS2ModelManager(model_dir=str(model_dir))
+else:
+    MS2ModelManager()

[Suggestion processed]

Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a potential TypeError if model_dir is None, as os.path.join in MS2ModelManager would fail. The proposed fix prevents this bug by handling the None case correctly.

Medium
High-level
Avoid hardcoding the model download URL

The model download URL is hardcoded in MS2ModelManager. To improve
maintainability, it should be imported from the peptdeep library or retrieved
dynamically instead.

Examples:

quantmsrescore/ms2_model_manager.py [39]
        self.model_url = "https://github.com/MannLabs/alphapeptdeep/releases/download/pre-trained-models/pretrained_models_v3.zip"

Solution Walkthrough:

Before:

class MS2ModelManager(ModelManager):
    def __init__(self,
                 model_dir: str = ".",
                 ):
        ...
        self.model_url = "https://github.com/MannLabs/alphapeptdeep/releases/download/pre-trained-models/pretrained_models_v3.zip"
        ...
        self._download_models(self.download_model_path)
        ...

    def _download_models(self, model_zip_file_path: str, overwrite: bool = True) -> None:
        url = self.model_url
        ...
        urllib.request.urlopen(url, ...)
        ...

After:

from peptdeep.pretrained_models import MODEL_URL # Assuming this constant exists

class MS2ModelManager(ModelManager):
    def __init__(self,
                 model_dir: str = ".",
                 ):
        ...
        self.model_url = MODEL_URL
        ...
        self._download_models(self.download_model_path)
        ...

    def _download_models(self, model_zip_file_path: str, overwrite: bool = True) -> None:
        url = self.model_url
        ...
        urllib.request.urlopen(url, ...)
        ...
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a hardcoded URL, which is a maintainability risk if the upstream library changes its model location, making this a significant design improvement.

Medium
General
Respect model_type when loading
Suggestion Impact:The patch modifies load_installed_models to load model files using f"{model_type}/..." paths (rt/ccs/charge and an added ms2 load), reflecting the intent to respect model_type instead of hardcoding "generic". However, it does not cleanly match the suggestion: it appears to remove the model_type parameter from the function signature while still referencing model_type, and it also leaves an initial hardcoded "generic/ms2.pth" load plus a second ms2 load.

code diff:

-    def load_installed_models(self, download_model_path: str = "pretrained_models_v3.zip", model_type: str = "generic"):
+    def load_installed_models(self, download_model_path: str = "pretrained_models_v3.zip"):
         """Load built-in MS2/CCS/RT models.
 
         Parameters
         ----------
-        model_type : str, optional
-            To load the installed MS2/RT/CCS models or phos MS2/RT/CCS models.
-            It could be 'digly', 'phospho', 'HLA', or 'generic'.
-            Defaults to 'generic'.
         download_model_path : str, optional
-            The path of model
+            The path of model zip file.
             Defaults to 'pretrained_models_v3.zip'.
         """
 
         self.ms2_model.load(
             download_model_path, model_path_in_zip="generic/ms2.pth"
         )
-        self.rt_model.load(download_model_path, model_path_in_zip="generic/rt.pth")
+        self.ms2_model.load(
+            download_model_path, model_path_in_zip=f"{model_type}/ms2.pth"
+        )
+        self.rt_model.load(
+            download_model_path, model_path_in_zip=f"{model_type}/rt.pth"
+        )
         self.ccs_model.load(
-            download_model_path, model_path_in_zip="generic/ccs.pth"
+            download_model_path, model_path_in_zip=f"{model_type}/ccs.pth"
         )
         self.charge_model.load(
-            download_model_path, model_path_in_zip="generic/charge.pth"
-        )
-
-    def train_ms2_model(
-            self,
+            download_model_path, model_path_in_zip=f"{model_type}/charge.pth"
+        )

Update load_installed_models to use the model_type parameter when constructing
model paths, instead of hardcoding "generic", enabling the loading of different
model types.

quantmsrescore/ms2_model_manager.py [121-130]

 self.ms2_model.load(
-    download_model_path, model_path_in_zip="generic/ms2.pth"
+    download_model_path, model_path_in_zip=f"{model_type}/ms2.pth"
 )
-self.rt_model.load(download_model_path, model_path_in_zip="generic/rt.pth")
+self.rt_model.load(
+    download_model_path, model_path_in_zip=f"{model_type}/rt.pth"
+)
 self.ccs_model.load(
-    download_model_path, model_path_in_zip="generic/ccs.pth"
+    download_model_path, model_path_in_zip=f"{model_type}/ccs.pth"
 )
 self.charge_model.load(
-    download_model_path, model_path_in_zip="generic/charge.pth"
+    download_model_path, model_path_in_zip=f"{model_type}/charge.pth"
 )

[Suggestion processed]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out that the load_installed_models method ignores its model_type parameter, hardcoding "generic". The fix makes the method functional as intended, allowing different model types to be loaded.

Medium
Use a dedicated cache directory

Modify MS2ModelManager to download models to a dedicated user cache directory
instead of the current working directory to avoid clutter. This can be achieved
using a library like appdirs.

quantmsrescore/ms2_model_manager.py [22-50]

+import appdirs
+
 class MS2ModelManager(ModelManager):
     def __init__(self,
                  mask_modloss: bool = False,
                  device: str = "gpu",
-                 model_dir: str = ".",
+                 model_dir: str = None,
                  ):
         self._train_psm_logging = True
 ...
         self.charge_model: ChargeModelForModAASeq = ChargeModelForModAASeq(
             device=device
         )
         self.model_url = "https://github.com/MannLabs/alphapeptdeep/releases/download/pre-trained-models/pretrained_models_v3.zip"
+
+        if model_dir is None:
+            model_dir = appdirs.user_cache_dir("quantmsrescore")
 
         if len(glob.glob(os.path.join(model_dir, "*ms2.pth"))) > 0:
             self.load_external_models(ms2_model_file=glob.glob(os.path.join(model_dir, "*ms2.pth"))[0])
             self.model_str = model_dir
         else:
             self.download_model_path = os.path.join(model_dir, "pretrained_models_v3.zip")
             self._download_models(self.download_model_path)
             self.load_installed_models(self.download_model_path)
             self.model_str = "generic"
         self.pretrained_ms2_model = copy.deepcopy(self.ms2_model)
         self.reset_by_global_settings(reload_models=False)

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 6

__

Why: The suggestion provides a valid user experience improvement by proposing to use a dedicated cache directory instead of the current working directory for model downloads. While this is a good design change, it introduces a new dependency.

Low
  • Update

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
quantmsrescore/transfer_learning.py (1)

206-209: Move import to module level.

The configure_logging import is placed inside the __init__ method, but it should be at the top of the file alongside other imports (line 5 already imports from the same module).

🔎 Proposed refactor

Move the import to the top of the file (around line 5):

 from quantmsrescore.idxmlreader import IdXMLRescoringReader
-from quantmsrescore.logging_config import get_logger
+from quantmsrescore.logging_config import get_logger, configure_logging
 from quantmsrescore.openms import OpenMSHelper

Then remove the import from the __init__ method:

         self._save_model_dir = save_model_dir
 
-        # Set up logging
-        from quantmsrescore.logging_config import configure_logging
-
         configure_logging(log_level)
quantmsrescore/ms2_model_manager.py (1)

284-291: Consider refactoring default argument computation.

The default value for charged_frag_types calls get_charged_frag_types(frag_types, max_frag_charge) at function definition time, which means the computation happens once when the module is loaded, not per instance. While this may be intentional for performance, it can be confusing and is flagged by static analysis (Ruff B008).

🔎 Proposed refactor
     def __init__(
             self,
-            charged_frag_types=get_charged_frag_types(frag_types, max_frag_charge),
+            charged_frag_types=None,
             dropout=0.1,
             model_class: torch.nn.Module = ModelMS2Bert,
             device: str = "gpu",
             mask_modloss: Optional[bool] = None,
             override_from_weights: bool = False,
             **kwargs,  # model params
     ):
+        if charged_frag_types is None:
+            charged_frag_types = get_charged_frag_types(frag_types, max_frag_charge)
         super().__init__(
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e2732a4 and 735be0d.

📒 Files selected for processing (5)
  • Dockerfile
  • quantmsrescore/constants.py
  • quantmsrescore/model_downloader.py
  • quantmsrescore/ms2_model_manager.py
  • quantmsrescore/transfer_learning.py
🧰 Additional context used
🧬 Code graph analysis (2)
quantmsrescore/transfer_learning.py (1)
quantmsrescore/logging_config.py (1)
  • configure_logging (54-165)
quantmsrescore/model_downloader.py (1)
quantmsrescore/ms2_model_manager.py (1)
  • MS2ModelManager (22-256)
🪛 Ruff (0.14.10)
quantmsrescore/ms2_model_manager.py

60-60: Avoid specifying long messages outside the exception class

(TRY003)


64-64: Avoid specifying long messages outside the exception class

(TRY003)


70-70: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


80-84: Avoid specifying long messages outside the exception class

(TRY003)


107-107: Unused method argument: model_type

(ARG002)


285-285: Do not perform function call get_charged_frag_types in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


303-303: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)


312-312: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Codacy Static Code Analysis
  • GitHub Check: build (3.11)
  • GitHub Check: build (3.11)
  • GitHub Check: Build and Push QuantMS-Rescoring Docker Images
🔇 Additional comments (8)
Dockerfile (1)

24-24: LGTM! Build and runtime stages now aligned.

The change aligns the build stage working directory with the runtime stage, improving consistency across the multi-stage build.

quantmsrescore/transfer_learning.py (3)

94-94: LGTM! Log level configuration added.

The CLI option provides users with control over logging verbosity, with a sensible default.


110-112: LGTM! Parameter properly integrated.

The log_level parameter is correctly added to the function signature with appropriate documentation.

Also applies to: 156-157


171-172: LGTM! Log level properly propagated.

The log level is correctly normalized to uppercase and passed to the trainer initialization.

quantmsrescore/constants.py (1)

70-70: LGTM! Feature mapping extended.

The new "MS2PIP:Cos" mapping follows the existing naming conventions and is properly positioned among related cosine features.

quantmsrescore/ms2_model_manager.py (3)

2-3: LGTM! Imports added for download functionality.

The new imports support secure model downloading with SSL verification using certifi.

Also applies to: 17-19


26-26: Verify default model directory behavior.

The constructor now uses model_dir="." as the default, which resolves to the current working directory. Ensure this works correctly when the code is invoked from different directories or contexts (e.g., Docker containers, different execution environments).

Consider whether a more stable default (e.g., a home directory path or config-based location) would be more robust across different execution contexts.

Also applies to: 39-48


58-60: LGTM! Proper security measures for downloads.

The URL scheme validation and SSL context with certifi ensure secure model downloads. The nosec comment appropriately documents the acknowledged security consideration.

Also applies to: 69-70

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
quantmsrescore/ms2_model_manager.py (3)

62-66: Clarify or fix the overwrite parameter semantics.

The current logic is backwards from typical overwrite semantics:

  • When overwrite=True and file exists: skips download (line 65 comment)
  • When overwrite=False and file exists: raises error

Typically, overwrite=True means "replace the existing file," not "skip if it exists." This naming could confuse future maintainers.

Consider either:

  1. Renaming the parameter to allow_existing or skip_if_exists to match the actual behavior, or
  2. Fixing the logic so overwrite=True actually re-downloads and replaces the file
🔎 Option 1: Rename parameter to match behavior
-    def _download_models(self, model_zip_file_path: str, overwrite: bool = True) -> None:
+    def _download_models(self, model_zip_file_path: str, allow_existing: bool = True) -> None:
         """Download models if not done yet."""
         url = self.model_url
         parsed = urllib.parse.urlparse(url)
         if parsed.scheme not in ("http", "https"):
             raise ValueError(f"Disallowed URL scheme: {parsed.scheme}")
 
         if os.path.exists(model_zip_file_path):
-            if not overwrite:
+            if not allow_existing:
                 raise FileExistsError(f"Model file already exists: {model_zip_file_path}")
-            # File exists and overwrite is True, skip download
+            # File exists and allow_existing is True, skip download
         else:
🔎 Option 2: Fix logic to match typical overwrite semantics
     def _download_models(self, model_zip_file_path: str, overwrite: bool = True) -> None:
         """Download models if not done yet."""
         url = self.model_url
         parsed = urllib.parse.urlparse(url)
         if parsed.scheme not in ("http", "https"):
             raise ValueError(f"Disallowed URL scheme: {parsed.scheme}")
 
-        if os.path.exists(model_zip_file_path):
-            if not overwrite:
-                raise FileExistsError(f"Model file already exists: {model_zip_file_path}")
-            # File exists and overwrite is True, skip download
-        else:
+        if os.path.exists(model_zip_file_path) and not overwrite:
+            raise FileExistsError(f"Model file already exists: {model_zip_file_path}")
+        
+        if not os.path.exists(model_zip_file_path) or overwrite:
             logging.info(f"Downloading pretrained models from {url} to {model_zip_file_path} ...")

282-282: Consider moving the function call out of the default argument.

The default argument calls get_charged_frag_types(frag_types, max_frag_charge) at function definition time. This can cause issues if:

  1. The function returns a mutable object shared across all instances
  2. The function depends on module-level state that changes

Move the call inside the function body instead.

🔎 Proposed fix
     def __init__(
             self,
-            charged_frag_types=get_charged_frag_types(frag_types, max_frag_charge),
+            charged_frag_types=None,
             dropout=0.1,
             model_class: torch.nn.Module = ModelMS2Bert,
             device: str = "gpu",
             mask_modloss: Optional[bool] = None,
             override_from_weights: bool = False,
             **kwargs,  # model params
     ):
+        if charged_frag_types is None:
+            charged_frag_types = get_charged_frag_types(frag_types, max_frag_charge)
         super().__init__(
             charged_frag_types=charged_frag_types,

299-303: Add stacklevel=2 to the deprecation warning.

The warnings.warn call should include stacklevel=2 so the warning points to the caller's code rather than this line. This helps users identify where they're using the deprecated parameter.

🔎 Proposed fix
         if mask_modloss is not None:
             warnings.warn(
                 "mask_modloss is deprecated and will be removed in the future. To mask the modloss fragments, "
                 "the charged_frag_types should not include the modloss fragments.",
+                stacklevel=2
             )
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eaf40cd and 4e47634.

📒 Files selected for processing (1)
  • quantmsrescore/ms2_model_manager.py
🧰 Additional context used
🪛 Ruff (0.14.10)
quantmsrescore/ms2_model_manager.py

60-60: Avoid specifying long messages outside the exception class

(TRY003)


64-64: Avoid specifying long messages outside the exception class

(TRY003)


71-71: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


81-85: Avoid specifying long messages outside the exception class

(TRY003)


282-282: Do not perform function call get_charged_frag_types in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


300-300: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)


309-309: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Codacy Static Code Analysis
  • GitHub Check: Build and Push QuantMS-Rescoring Docker Images
  • GitHub Check: build (3.11)
  • GitHub Check: build (3.11)
🔇 Additional comments (3)
quantmsrescore/ms2_model_manager.py (3)

2-3: LGTM! Imports support custom download implementation.

The new imports (MODEL_DOWNLOAD_INSTRUCTIONS, is_model_zip, urllib, ssl, certifi) appropriately support the custom model download implementation with SSL/certificate handling.

Also applies to: 17-19


26-26: LGTM! Constructor changes streamline model initialization.

The changes consolidate model discovery, download, and loading into a unified flow:

  1. Check for external models in model_dir
  2. If not found, download and load the pretrained models
  3. Store the model source for reference

The default model_dir="." and the new model_url attribute are appropriate.

Also applies to: 39-39, 41-48


108-128: LGTM! Method now has clear, focused responsibility.

The model_type parameter flagged in the previous review has been removed, and the method now cleanly loads the generic models from the specified zip file. This is a good simplification.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
quantmsrescore/model_downloader.py (1)

270-272: Consider optimizing download-only flow.

MS2ModelManager.init downloads models and also initializes all model instances (MS2, RT, CCS, charge) into memory, which are immediately discarded since the manager object isn't stored. For a download-only function, this initialization overhead is unnecessary.

If MS2ModelManager exposed a static download method or a lighter initialization path that only downloads without loading models, it would be more efficient for this use case.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e47634 and 399d378.

📒 Files selected for processing (1)
  • quantmsrescore/model_downloader.py
🧰 Additional context used
🧬 Code graph analysis (1)
quantmsrescore/model_downloader.py (1)
quantmsrescore/ms2_model_manager.py (1)
  • MS2ModelManager (22-253)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Codacy Static Code Analysis
  • GitHub Check: Build and Push QuantMS-Rescoring Docker Images
  • GitHub Check: build (3.11)
  • GitHub Check: build (3.11)
🔇 Additional comments (2)
quantmsrescore/model_downloader.py (2)

16-16: LGTM! Import added for refactored download flow.

The MS2ModelManager import is correctly added to support the delegated download logic.


270-272: Previous issue resolved correctly.

The None handling has been properly fixed. Line 271 correctly converts model_dir (Optional[Path]) to a string, defaulting to "." when None, before passing to MS2ModelManager. This addresses the previous TypeError concern.

ypriverol and others added 3 commits January 2, 2026 19:27
Co-authored-by: qodo-code-review[bot] <151058649+qodo-code-review[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
quantmsrescore/ms2_model_manager.py (3)

21-52: Duplicate threading configuration logic.

_configure_torch_for_hpc duplicates the functionality of configure_torch_threads in quantmsrescore/__init__.py. Both set torch.set_num_threads and torch.set_num_interop_threads. Additionally, calling this at module import time (line 52) happens after __init__.py has already configured threading.

🔎 Proposed consolidation

Consider removing this local function and relying on the centralized threading configuration:

-def _configure_torch_for_hpc(n_threads: int = 1) -> None:
-    """
-    Configure PyTorch thread settings for HPC environments.
-    ...
-    """
-    try:
-        # Limit intra-op parallelism (within single operations)
-        torch.set_num_threads(n_threads)
-        # Limit inter-op parallelism (between independent operations)
-        torch.set_num_interop_threads(n_threads)
-    except RuntimeError:
-        # Threads already configured (can only be set once per process)
-        pass
-
-
-# Apply PyTorch thread limits immediately
-_configure_torch_for_hpc(n_threads=1)
+# Threading is configured by quantmsrescore.__init__.configure_threading()

If you need explicit control here, import and call configure_torch_threads from quantmsrescore.


100-106: Short timeout for model downloads.

The 10-second timeout on line 104 may be insufficient for downloading large model files, especially on slower network connections. This could cause unexpected failures.

🔎 Proposed fix
-                requests = urllib.request.urlopen(url, context=context, timeout=10)  # nosec B310
+                requests = urllib.request.urlopen(url, context=context, timeout=300)  # nosec B310

Alternatively, consider using a streaming download approach with progress indication for better UX with large files.


332-336: Add stacklevel to warnings.warn.

The warning lacks a stacklevel argument, which means the warning will point to this line rather than the caller's location.

🔎 Proposed fix
         if mask_modloss is not None:
             warnings.warn(
                 "mask_modloss is deprecated and will be removed in the future. To mask the modloss fragments, "
-                "the charged_frag_types should not include the modloss fragments."
+                "the charged_frag_types should not include the modloss fragments.",
+                stacklevel=2
             )
quantmsrescore/__init__.py (1)

82-139: Consider threadpoolctl for BLAS/OpenMP control, though it's complementary rather than a replacement.

threadpoolctl provides a unified interface for BLAS libraries (MKL, OpenBLAS, BLIS) and OpenMP runtimes, with support for temporary/context-manager-based control. However, it doesn't cover TensorFlow-specific settings (TF_NUM_INTEROP_THREADS, TF_NUM_INTRAOP_THREADS) or NumExpr, so the current environment variable approach is necessary for comprehensive library coverage in this context.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 399d378 and b6da28e.

📒 Files selected for processing (8)
  • environment.yml
  • pyproject.toml
  • quantmsrescore/__init__.py
  • quantmsrescore/idxmlreader.py
  • quantmsrescore/ms2_model_manager.py
  • quantmsrescore/ms2rescore.py
  • quantmsrescore/transfer_learning.py
  • requirements.txt
🚧 Files skipped from review as they are similar to previous changes (1)
  • quantmsrescore/transfer_learning.py
🧰 Additional context used
🧬 Code graph analysis (2)
quantmsrescore/idxmlreader.py (1)
quantmsrescore/openms.py (1)
  • is_decoy_peptide_hit (131-149)
quantmsrescore/ms2rescore.py (1)
quantmsrescore/__init__.py (2)
  • configure_threading (82-139)
  • configure_torch_threads (142-165)
🪛 Ruff (0.14.10)
quantmsrescore/__init__.py

10-10: Comment contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF003)


219-225: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

quantmsrescore/ms2_model_manager.py

93-93: Avoid specifying long messages outside the exception class

(TRY003)


97-97: Avoid specifying long messages outside the exception class

(TRY003)


104-104: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


114-118: Avoid specifying long messages outside the exception class

(TRY003)


315-315: Do not perform function call get_charged_frag_types in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


333-333: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)


342-342: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build (3.11)
  • GitHub Check: build (3.11)
  • GitHub Check: Build and Push QuantMS-Rescoring Docker Images
  • GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (9)
pyproject.toml (1)

45-45: LGTM!

The psutil dependency addition is appropriate for the memory-aware process counting functionality. The unpinned version is consistent with other dependencies in this file.

requirements.txt (1)

12-13: LGTM!

The psutil dependency is added consistently with pyproject.toml and environment.yml.

quantmsrescore/idxmlreader.py (1)

225-235: Good performance optimization.

The switch from DataFrame.append() in a loop to collecting records in a list and constructing the DataFrame in a single operation is a significant improvement (O(n) vs O(n²)). The docstring clearly documents the rationale.

quantmsrescore/ms2_model_manager.py (1)

141-160: Past issue addressed.

The unused model_type parameter flagged in previous reviews has been removed, simplifying the API. This is a valid resolution if the functionality isn't needed.

environment.yml (1)

21-21: LGTM!

The psutil dependency is added consistently across all dependency manifests.

quantmsrescore/ms2rescore.py (2)

7-8: LGTM!

Correct import ordering - threading configuration is imported before heavy library imports (FeatureAnnotator).


225-229: Good HPC-aware threading setup.

The threading configuration is correctly applied before creating the FeatureAnnotator, ensuring that all downstream library imports respect the thread limits. The explanatory comment is helpful.

quantmsrescore/__init__.py (2)

1-22: Well-documented threading control module.

The module-level documentation clearly explains the HPC thread explosion problem and the solution. The approach of setting environment variables before heavy library imports is the correct pattern.


168-198: Good fallback handling for psutil.

The get_safe_process_count function gracefully handles the case where psutil is not available by falling back to CPU count. This maintains compatibility while providing better resource management when possible.

ypriverol and others added 2 commits January 2, 2026 22:19
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
quantmsrescore/openms.py (1)

674-740: Clarify memory efficiency claims when use_cache=False.

The docstring for iter_mslevel_spectra states it's "more memory-efficient than get_mslevel_spectra()", but this is only true when use_cache=True. When use_cache=False, both methods call get_spectrum_lookup_indexer, which loads the entire MSExperiment into memory.

Consider updating the docstring to clarify:

  • "This is more memory-efficient than get_mslevel_spectra() when you're iterating once through cached data..."
  • Or note that even with use_cache=False, the underlying MSExperiment is fully loaded

This sets correct expectations for users trying to minimize memory usage.

📝 Suggested docstring improvement
     def iter_mslevel_spectra(
         file_name: Union[str, Path],
         ms_level: int,
         use_cache: bool = True
     ) -> Generator[oms.MSSpectrum, None, None]:
         """
         Iterate over spectra of a specific MS level (memory-efficient generator).
 
-        This is more memory-efficient than get_mslevel_spectra() when you don't
-        need all spectra at once.
+        This is more memory-efficient than get_mslevel_spectra() when you need
+        to iterate through spectra without storing them all in memory at once.
+        Note: The underlying MSExperiment is still fully loaded; this avoids
+        creating an intermediate list.
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b6da28e and 76c51ca.

📒 Files selected for processing (4)
  • quantmsrescore/alphapeptdeep.py
  • quantmsrescore/annotator.py
  • quantmsrescore/ms2pip.py
  • quantmsrescore/openms.py
🧰 Additional context used
🧬 Code graph analysis (3)
quantmsrescore/annotator.py (3)
quantmsrescore/idxmlreader.py (2)
  • psms (90-92)
  • psms (95-99)
quantmsrescore/ms2pip.py (1)
  • add_features (298-340)
quantmsrescore/openms.py (2)
  • OpenMSHelper (103-751)
  • clear_spectrum_cache (83-100)
quantmsrescore/alphapeptdeep.py (1)
quantmsrescore/ms2pip.py (1)
  • read_spectrum_file (556-615)
quantmsrescore/ms2pip.py (2)
quantmsrescore/alphapeptdeep.py (1)
  • read_spectrum_file (972-1031)
quantmsrescore/openms.py (2)
  • OpenMSHelper (103-751)
  • iter_mslevel_spectra (708-740)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Codacy Static Code Analysis
  • GitHub Check: Build and Push QuantMS-Rescoring Docker Images
  • GitHub Check: build (3.11)
  • GitHub Check: build (3.11)
🔇 Additional comments (4)
quantmsrescore/alphapeptdeep.py (1)

972-1031: LGTM! Memory-efficient iterator with caching support.

The change from get_mslevel_spectra to iter_mslevel_spectra improves memory efficiency by using a generator instead of loading all spectra into memory at once. The addition of the use_cache parameter (defaulting to True) enables sharing spectrum data across MS2PIP and AlphaPeptDeep feature generators, reducing redundant file I/O.

The docstring clearly documents the new parameter and its purpose.

quantmsrescore/ms2pip.py (1)

556-615: LGTM! Consistent caching pattern across feature generators.

This change mirrors the identical improvement in quantmsrescore/alphapeptdeep.py, ensuring both MS2PIP and AlphaPeptDeep benefit from the same caching and memory-efficiency improvements. The consistency across feature generators is good for maintainability.

quantmsrescore/annotator.py (2)

282-284: Good practice: explicit memory cleanup after annotation.

Calling clear_spectrum_cache() and gc.collect() after annotation completes is good for releasing resources promptly. This aligns with the broader memory optimization effort in this PR and ensures that cached spectrum data doesn't persist unnecessarily.


702-726: Memory-efficient model comparison, contingent on shallow copy safety.

The replacement of copy.deepcopy() with _shallow_copy_psm_list() is excellent for memory efficiency, and the explicit cleanup (del + gc.collect()) after each model evaluation ensures memory is promptly released.

However, the safety of this change depends on the shallow copy implementation being correct (see comment on lines 20-63). If shared mutable attributes are modified during add_features(), this could cause issues.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
quantmsrescore/idxmlreader.py (1)

307-308: Consider specifying columns for empty DataFrame consistency.

When psm_records is empty, the resulting DataFrame has no columns. While this is likely a rare edge case, specifying the expected column names ensures schema consistency.

🔎 Optional improvement to maintain schema
-        self._psms_df = pd.DataFrame(psm_records) if psm_records else pd.DataFrame()
+        self._psms_df = pd.DataFrame(psm_records) if psm_records else pd.DataFrame(
+            columns=["sequence", "charge", "mods", "mod_sites", "nce", 
+                     "provenance_data", "instrument", "spectrum_ref", "filename",
+                     "is_decoy", "rank", "score"]
+        )
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 76c51ca and 11bb05b.

📒 Files selected for processing (1)
  • quantmsrescore/idxmlreader.py
🧰 Additional context used
🧬 Code graph analysis (1)
quantmsrescore/idxmlreader.py (1)
quantmsrescore/openms.py (1)
  • is_decoy_peptide_hit (193-211)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Codacy Static Code Analysis
  • GitHub Check: Build and Push QuantMS-Rescoring Docker Images
  • GitHub Check: build (3.11)
🔇 Additional comments (3)
quantmsrescore/idxmlreader.py (3)

225-231: Excellent documentation of the performance rationale.

The added notes clearly explain why the list-based approach is preferred over DataFrame.append() in a loop, making the design decision transparent to future maintainers.


233-235: Well-structured initialization with clear performance comments.

The list-based approach is the correct pattern for building DataFrames efficiently.


290-304: Excellent performance improvement and bug fix.

The refactor from DataFrame append to list append delivers the promised O(n) performance, and line 300 now correctly handles the None filename case (addressing the previous TypeError concern).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
quantmsrescore/annotator.py (1)

20-63: Fix default values to preserve None semantics.

Lines 55, 58, and 59 incorrectly use empty containers as fallback values instead of preserving None:

  • Line 55: protein_list=psm.protein_list.copy() if psm.protein_list else [] → should be else None
  • Line 58: provenance_data=psm.provenance_data.copy() if psm.provenance_data else {} → should be else None
  • Line 59: metadata=psm.metadata.copy() if psm.metadata else {} → should be else None

Using empty containers instead of None changes the semantics and may break code that checks for None explicitly.

🔎 Proposed fix
-            protein_list=psm.protein_list.copy() if psm.protein_list else [],
+            protein_list=psm.protein_list.copy() if psm.protein_list else None,
             rank=psm.rank,
             source=psm.source,
-            provenance_data=psm.provenance_data.copy() if psm.provenance_data else {},  # Can share as keys are read-only
-            metadata=psm.metadata.copy() if psm.metadata else {},
+            provenance_data=psm.provenance_data.copy() if psm.provenance_data else None,
+            metadata=psm.metadata.copy() if psm.metadata else None,
             rescoring_features={},  # Fresh dict - this is what will be modified
🧹 Nitpick comments (4)
quantmsrescore/openms.py (1)

176-211: Optimize the enumeration path in organize_psms_by_spectrum_id.

Line 206 uses enumerated_psm_list.index(item) which is O(n) and inefficient when repeatedly called for non-tuple items. If the input is a raw PSM list, consider enumerating it once upfront:

def organize_psms_by_spectrum_id(
    enumerated_psm_list: List[Any]
) -> Dict[str, List[Tuple[int, Any]]]:
    from collections import defaultdict
    psms_by_specid = defaultdict(list)
    
    # Check first item to determine format
    if enumerated_psm_list and not (isinstance(enumerated_psm_list[0], tuple) and len(enumerated_psm_list[0]) == 2):
        # Enumerate once if not already enumerated
        enumerated_psm_list = list(enumerate(enumerated_psm_list))
    
    for psm_index, psm in enumerated_psm_list:
        psms_by_specid[str(psm.spectrum_id)].append((psm_index, psm))
    
    return psms_by_specid

This avoids O(n²) behavior when processing raw PSM lists.

quantmsrescore/ms2_model_manager.py (3)

108-108: Handle edge case where model path has no directory component.

If model_zip_file_path is just a filename (e.g., "model.zip"), os.path.dirname() returns an empty string, and os.makedirs("") may behave unexpectedly on some systems.

🔎 Proposed defensive fix
-            os.makedirs(os.path.dirname(model_zip_file_path), exist_ok=True)
+            model_dir = os.path.dirname(model_zip_file_path)
+            if model_dir:
+                os.makedirs(model_dir, exist_ok=True)

356-356: Verify whether **kwargs is needed for interface compatibility.

The **kwargs parameter is unused in the method body. If this method overrides a parent class method or implements an interface that requires **kwargs, the parameter is justified. Otherwise, it can be removed.


347-347: Add stacklevel=2 to improve warning clarity.

The warning should point to the caller's location, not the function definition. Adding stacklevel=2 improves the developer experience when diagnosing deprecation warnings.

🔎 Proposed fix
         warnings.warn(
             "mask_modloss is deprecated and will be removed in the future. To mask the modloss fragments, "
             "the charged_frag_types should not include the modloss fragments.",
+            stacklevel=2
         )
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 11bb05b and 6b75a5f.

📒 Files selected for processing (6)
  • quantmsrescore/alphapeptdeep.py
  • quantmsrescore/annotator.py
  • quantmsrescore/ms2_model_manager.py
  • quantmsrescore/ms2pip.py
  • quantmsrescore/openms.py
  • quantmsrescore/transfer_learning.py
🧰 Additional context used
🧬 Code graph analysis (3)
quantmsrescore/annotator.py (1)
quantmsrescore/openms.py (1)
  • clear_spectrum_cache (143-166)
quantmsrescore/alphapeptdeep.py (2)
quantmsrescore/openms.py (5)
  • OpenMSHelper (237-885)
  • get_compiled_regex (53-82)
  • organize_psms_by_spectrum_id (176-211)
  • calculate_correlations (214-234)
  • iter_mslevel_spectra (842-874)
quantmsrescore/ms2pip.py (1)
  • read_spectrum_file (549-608)
quantmsrescore/ms2pip.py (2)
quantmsrescore/openms.py (5)
  • OpenMSHelper (237-885)
  • get_compiled_regex (53-82)
  • organize_psms_by_spectrum_id (176-211)
  • calculate_correlations (214-234)
  • iter_mslevel_spectra (842-874)
quantmsrescore/alphapeptdeep.py (1)
  • read_spectrum_file (960-1019)
🪛 Ruff (0.14.10)
quantmsrescore/ms2_model_manager.py

99-99: Avoid specifying long messages outside the exception class

(TRY003)


103-103: Avoid specifying long messages outside the exception class

(TRY003)


112-112: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


128-132: Avoid specifying long messages outside the exception class

(TRY003)


329-329: Do not perform function call get_charged_frag_types in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


347-347: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)


356-356: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Codacy Static Code Analysis
  • GitHub Check: Build and Push QuantMS-Rescoring Docker Images
  • GitHub Check: build (3.11)
  • GitHub Check: build (3.11)
🔇 Additional comments (7)
quantmsrescore/ms2pip.py (1)

30-35: LGTM! Caching integration is well-implemented.

The changes successfully integrate the new caching infrastructure:

  • Import of shared utilities (get_compiled_regex, organize_psms_by_spectrum_id, calculate_correlations) eliminates code duplication
  • use_cache parameter enables memory-efficient spectrum loading with proper documentation
  • Consistent usage of cached utilities throughout the processing pipeline

Also applies to: 549-608, 812-816

quantmsrescore/openms.py (1)

40-82: LGTM! Caching infrastructure is well-designed.

The new caching infrastructure provides:

  • Bounded LRU cache (MAX_CACHE_SIZE=3) preventing unbounded memory growth
  • Efficient regex compilation caching with fallback handling
  • Clear separation of concerns between get/iter spectrum methods
  • Explicit cache clearing for memory management

Also applies to: 85-141, 143-167, 214-235, 808-874

quantmsrescore/alphapeptdeep.py (1)

10-15: LGTM! Consistent integration with caching infrastructure.

The changes mirror the ms2pip.py implementation:

  • Shared utilities properly imported and used
  • use_cache parameter consistently applied
  • Regex compilation and PSM organization follow the same efficient pattern

Also applies to: 831-835, 902-907, 960-1019

quantmsrescore/annotator.py (1)

282-284: LGTM! Memory management properly implemented.

The shallow copy approach combined with explicit cache clearing is effective:

  • _shallow_copy_psm_list avoids deep copy overhead while creating fresh rescoring_features dicts
  • clear_spectrum_cache() and gc.collect() calls prevent memory leaks
  • Strategic cleanup between model evaluations reduces peak memory usage

Also applies to: 702-726

quantmsrescore/transfer_learning.py (2)

167-170: Verify threading configuration aligns with user expectations.

Lines 167-170 hardcode n_threads=1 regardless of the processes parameter value. While the help text explains "Each process uses 1 internal thread to avoid HPC resource contention," this design may be surprising:

  • User passes --processes=4 expecting 4-way parallelism
  • Code forces internal threading to 1, which is correct for avoiding oversubscription in multiprocessing
  • However, the relationship between processes (external parallelism) and internal threading (forced to 1) should be clearer

Consider:

  1. Clarifying in the help text that processes controls the multiprocessing pool size, not internal threads
  2. Adding a code comment explaining why threading is forced to 1 in HPC contexts
  3. Verifying that peptdeep/torch models respect these thread limits

5-6: LGTM! Logging and caching integration is correct.

The changes properly integrate logging and caching utilities:

  • Log level flows from CLI → trainer → configure_logging
  • get_compiled_regex replaces inline regex compilation
  • Threading configuration documented in help text

Also applies to: 9-9, 100-100, 184-184, 201-201, 219-221, 330-331

quantmsrescore/ms2_model_manager.py (1)

155-175: Verify whether model_type support was intentionally removed.

Previous review comments suggested adding a model_type parameter to support loading different model variants (e.g., 'phospho', 'HLA', 'digly'), but the current implementation hardcodes "generic/" paths for all models. The past suggestion was marked as "manually_applied" (lines 168-177), yet the parameter is absent.

Was the model_type feature intentionally removed to simplify the API, or should it still be implemented per the previous review?

Comment on lines 101 to 105
if os.path.exists(model_zip_file_path):
if not overwrite:
raise FileExistsError(f"Model file already exists: {model_zip_file_path}")
# File exists and overwrite is True, skip download
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clarify or fix the overwrite parameter behavior.

The current logic skips the download when the file exists and overwrite=True, rather than actually overwriting the file. This is misleading:

  • Line 102-103: If file exists and overwrite=False, raise an error ✓
  • Line 104: Comment says "File exists and overwrite is True, skip download"
  • Expected: If overwrite=True, the file should be re-downloaded/overwritten

Consider either:

  1. Rename the parameter to skip_if_exists or allow_existing for clarity, or
  2. Implement actual overwrite behavior by removing the existing file before download
🔎 Proposed fix to implement actual overwrite behavior
     if os.path.exists(model_zip_file_path):
         if not overwrite:
             raise FileExistsError(f"Model file already exists: {model_zip_file_path}")
-        # File exists and overwrite is True, skip download
+        # File exists and overwrite is True, remove it to re-download
+        os.remove(model_zip_file_path)
+
+    logging.info(f"Downloading pretrained models from {url} to {model_zip_file_path} ...")
+    try:
+        os.makedirs(os.path.dirname(model_zip_file_path), exist_ok=True)
-    else:
-        logging.info(f"Downloading pretrained models from {url} to {model_zip_file_path} ...")
-        try:
-            os.makedirs(os.path.dirname(model_zip_file_path), exist_ok=True)

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.10)

103-103: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In quantmsrescore/ms2_model_manager.py around lines 101 to 105, the current
logic treats overwrite=True as "skip download" which is misleading; change
behavior so that if os.path.exists(model_zip_file_path) and overwrite is True,
delete the existing file (os.remove or Path.unlink) before proceeding to
download and writing the new file, and if overwrite is False keep the existing
raise FileExistsError path; ensure deletion errors are handled or propagated
with a clear error message and that any parent directories and permissions are
respected before attempting the download.

@jpfeuffer
Copy link

Ok. Can you check the last three Major Code Rabbit comments? They make sense.

@ypriverol
Copy link
Member Author

Im on it.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
quantmsrescore/ms2_model_manager.py (1)

362-366: Add stacklevel=2 to warnings.warn for accurate caller location.

Without an explicit stacklevel, the warning will point to this line rather than the caller's code that passed the deprecated mask_modloss argument.

🔎 Proposed fix
         if mask_modloss is not None:
             warnings.warn(
                 "mask_modloss is deprecated and will be removed in the future. To mask the modloss fragments, "
-                "the charged_frag_types should not include the modloss fragments."
+                "the charged_frag_types should not include the modloss fragments.",
+                stacklevel=2,
             )
quantmsrescore/__init__.py (2)

82-140: Consider using threadpoolctl for cleaner thread management.

The current manual environment variable approach works, but threadpoolctl provides a more maintainable solution. It's recommended by NumPy and would simplify this function significantly.

Based on past discussion in this PR, the team has already identified threadpoolctl as a preferable approach. Consider migrating to it in a follow-up.

Example using threadpoolctl
def configure_threading(n_threads: Optional[int] = None, verbose: bool = False) -> None:
    """Configure thread counts using threadpoolctl."""
    if n_threads is None:
        n_threads = _DEFAULT_THREADS_PER_PROCESS
    
    try:
        from threadpoolctl import threadpool_limits
        # This context manager can be used globally or per-operation
        threadpool_limits(limits=n_threads, user_api='blas')
        threadpool_limits(limits=n_threads, user_api='openmp')
    except ImportError:
        # Fallback to current env var approach
        pass
    
    # TensorFlow specific settings (not covered by threadpoolctl)
    os.environ.setdefault("TF_FORCE_GPU_ALLOW_GROWTH", "true")
    os.environ.setdefault("TF_CPP_MIN_LOG_LEVEL", "2")

Note: threadpoolctl needs to be added to dependencies. See https://pypi.org/project/threadpoolctl/


228-234: Optional: Consider sorting __all__ alphabetically.

The current grouping is logical (threading functions, then helper functions, then metadata), but alphabetical sorting can improve consistency.

Alphabetically sorted version
 __all__ = [
+    "__version__",
+    "calculate_optimal_parallelism",
     "configure_threading",
     "configure_torch_threads",
-    "calculate_optimal_parallelism",
     "get_safe_process_count",
-    "__version__",
 ]
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6b75a5f and 20fbd00.

📒 Files selected for processing (2)
  • quantmsrescore/__init__.py
  • quantmsrescore/ms2_model_manager.py
🧰 Additional context used
🪛 Ruff (0.14.10)
quantmsrescore/__init__.py

10-10: Comment contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF003)


228-234: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

quantmsrescore/ms2_model_manager.py

110-110: Avoid specifying long messages outside the exception class

(TRY003)


114-114: Avoid specifying long messages outside the exception class

(TRY003)


123-123: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


139-143: Avoid specifying long messages outside the exception class

(TRY003)


363-363: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)


372-372: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Codacy Static Code Analysis
  • GitHub Check: build (3.11)
  • GitHub Check: build (3.11)
  • GitHub Check: Build and Push QuantMS-Rescoring Docker Images
🔇 Additional comments (12)
quantmsrescore/ms2_model_manager.py (8)

1-20: LGTM!

The new imports (urllib, ssl, certifi, shutil) are appropriate for implementing secure streaming downloads with proper SSL certificate handling.


22-56: LGTM!

The opt-in approach via QUANTMS_HPC_MODE environment variable addresses the previous concern about automatic thread configuration at import time. The public configure_torch_for_hpc() function allows explicit control when needed.


59-87: LGTM!

The initialization logic correctly handles both scenarios: loading external models when present, or downloading and loading the default pretrained models. The default model_dir="." is a reasonable choice for the current working directory.


126-143: LGTM!

Good defensive programming: cleaning up partial downloads on failure and validating the downloaded file is a valid zip before proceeding.


166-185: LGTM!

The method correctly loads all model types (MS2, RT, CCS, Charge) from the downloaded zip file. The previous concern about the unused model_type parameter has been addressed by removing it.


338-351: LGTM!

The fix for the default argument issue (Ruff B008) is correct. Computing charged_frag_types inside the method when None is passed ensures it's evaluated at call time rather than definition time.


368-394: LGTM!

The **kwargs parameter is retained for interface compatibility with the parent class. The batch prediction logic correctly handles normalization, masking, and both ordered and sliced updates.


420-464: LGTM!

The function correctly handles updating DataFrame slices with appropriate optimizations: fast numpy slicing when all columns are updated, and iloc with specific column indices when only certain fragment types need updating.

quantmsrescore/__init__.py (4)

24-79: LGTM!

The resource calculation logic correctly balances CPU and memory constraints for HPC environments. The conservative default of 1 thread per process prevents the thread explosion issue described in the header comments.


142-165: LGTM!

PyTorch thread configuration is implemented correctly with proper error handling for both missing PyTorch installation and already-configured threads.


168-198: LGTM!

Resource-aware process calculation with appropriate fallback when psutil is unavailable. The conservative approach of returning at least 1 process ensures the function never blocks execution.


201-216: LGTM!

The opt-in automatic configuration via QUANTMS_HPC_MODE is well-designed. The comments clearly explain both the automatic and explicit control approaches, and the CLI commands handle threading explicitly regardless of this setting.

Comment on lines +117 to +125
logging.info(f"Downloading pretrained models from {url} to {model_zip_file_path} ...")
try:
os.makedirs(os.path.dirname(model_zip_file_path), exist_ok=True)
context = ssl.create_default_context(cafile=certifi.where())
# Use streaming download with longer timeout for large model files
# timeout=300s (5 min) for slow connections; stream in 1MB chunks
with urllib.request.urlopen(url, context=context, timeout=300) as response: # nosec B310
with open(model_zip_file_path, "wb") as out_file:
shutil.copyfileobj(response, out_file, length=1024 * 1024) # 1MB chunks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against empty dirname before calling os.makedirs.

When model_zip_file_path is a simple filename without a directory component (e.g., "pretrained_models_v3.zip"), os.path.dirname() returns an empty string. Calling os.makedirs("", exist_ok=True) can raise FileNotFoundError on some platforms.

🔎 Proposed fix
             logging.info(f"Downloading pretrained models from {url} to {model_zip_file_path} ...")
             try:
-                os.makedirs(os.path.dirname(model_zip_file_path), exist_ok=True)
+                parent_dir = os.path.dirname(model_zip_file_path)
+                if parent_dir:
+                    os.makedirs(parent_dir, exist_ok=True)
                 context = ssl.create_default_context(cafile=certifi.where())
🧰 Tools
🪛 Ruff (0.14.10)

123-123: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents
In quantmsrescore/ms2_model_manager.py around lines 117 to 125, calling
os.makedirs(os.path.dirname(model_zip_file_path), exist_ok=True) can fail when
model_zip_file_path has no directory component (dirname == ""), so guard against
an empty dirname: compute dirpath = os.path.dirname(model_zip_file_path) and
only call os.makedirs(dirpath, exist_ok=True) if dirpath is non-empty (or
alternatively replace empty dirname with "." to mean current directory); keep
the rest of the download logic unchanged.

@ypriverol ypriverol merged commit 4178554 into main Jan 4, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants