Skip to content

refactor: consolidate utilities and improve type safety#20

Open
razvanazamfirei wants to merge 2 commits intomainfrom
code-cleanup
Open

refactor: consolidate utilities and improve type safety#20
razvanazamfirei wants to merge 2 commits intomainfrom
code-cleanup

Conversation

@razvanazamfirei
Copy link
Copy Markdown
Owner

@razvanazamfirei razvanazamfirei commented Apr 16, 2026

Summary by CodeRabbit

  • New Features

    • Environment-variable ML model path support and new processor configuration options for process-pool chunking.
    • Added a public typing marker to enable packaged type information.
  • Improvements

    • Unified text-cleaning and missing-value detection across the codebase.
    • Tunable cache sizing and standardized confidence constants for ML/rule logic.
    • Tightened typing on processor inputs/outputs and improved logging behavior.
  • Documentation

    • Added developer guidelines covering repository conventions, tooling, and testing.
  • Tests

    • Adjusted log-level expectations in processor tests.

- Centralize is_missing_scalar() in types.py and clean_text() in utils.py
- Add named confidence constants in hybrid.py (replaces hardcoded 0.8/0.85/1.0)
- Add LRU_CACHE_SIZE constant to avoid magic numbers
- Add py.typed marker for PEP 561 compliance
- Remove unnecessary pass statements from exception classes
- Improve processor chunk configuration via instance parameters
- Enhance logging error messages with context
Copilot AI review requested due to automatic review settings April 16, 2026 02:30
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 16, 2026

Codecov Report

❌ Patch coverage is 94.79167% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.85%. Comparing base (037d065) to head (4d7eb9d).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/case_parser/ml/loader.py 60.00% 2 Missing ⚠️
src/case_parser/logging_config.py 50.00% 1 Missing ⚠️
src/case_parser/ml/hybrid.py 75.00% 0 Missing and 1 partial ⚠️
src/case_parser/ml/inputs.py 88.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #20      +/-   ##
==========================================
- Coverage   82.78%   78.85%   -3.93%     
==========================================
  Files          31       31              
  Lines        2219     2346     +127     
  Branches      325      347      +22     
==========================================
+ Hits         1837     1850      +13     
- Misses        311      426     +115     
+ Partials       71       70       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors case_parser to centralize common utilities (missing-value checks, text normalization, cache sizing), tighten type annotations across the processor/ML pipeline, and improve configurability of ML model loading.

Changes:

  • Consolidates “missing scalar” checks and text normalization into shared helpers (is_missing_scalar, clean_text) and standardizes lru_cache sizing via LRU_CACHE_SIZE.
  • Improves type annotations throughout CaseProcessor and extraction/ML helpers, and adds a py.typed marker for PEP 561 typing support.
  • Adds CASE_PARSER_MODEL_PATH environment variable support for selecting the ML model file at runtime.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_enhanced_processor.py Updates log expectations for process-chunk fallback behavior.
src/case_parser/utils.py Introduces shared LRU_CACHE_SIZE and clean_text() helper.
src/case_parser/types.py Adds is_missing_scalar() helper and related imports.
src/case_parser/py.typed Marks package as typed (PEP 561).
src/case_parser/processor.py Refactors missing checks/types, adds configurable process-pool thresholds, and refines fallback logging.
src/case_parser/patterns/categorization.py Reuses shared missing-scalar helper and centralized cache size constant.
src/case_parser/patterns/block_site_patterns.py Replaces local text cleaning with shared clean_text().
src/case_parser/patterns/approach_patterns.py Uses shared LRU_CACHE_SIZE for cached detectors.
src/case_parser/ml/loader.py Adds env-var based ML model path resolution.
src/case_parser/ml/inputs.py Switches to shared text cleaning and adjusts feature input normalization.
src/case_parser/ml/hybrid.py Replaces magic numbers with named constants and improves Protocol docstrings.
src/case_parser/ml/features.py Uses shared LRU_CACHE_SIZE for cached feature extraction.
src/case_parser/logging_config.py Replaces getattr lookup with explicit log-level map.
src/case_parser/extractors.py Reuses shared missing-scalar logic and tightens input types.
src/case_parser/exceptions.py Simplifies exception definitions (removes redundant pass).
AGENTS.md Adds contributor guidance (commands, architecture, typing/testing conventions).
Comments suppressed due to low confidence (1)

src/case_parser/processor.py:813

  • These row/value annotations assume every DataFrame cell is Scalar, but row.get(self.column_map.date) commonly yields pd.Timestamp/datetime values (and the implementation explicitly handles them via pd.to_datetime / _normalize_timestamp_to_utc). This mismatch undermines the PR’s type-safety goal. Either widen the row/value types here (e.g., to object / a broader alias) or expand Scalar so the annotations reflect the actual inputs.
    def _prepare_rows(
        self, rows: Sequence[Mapping[Hashable, Scalar]]
    ) -> list[_PreparedRow]:
        """Precompute per-row metadata for a dataframe batch.

        This batches date parsing and hybrid categorization so downstream
        row-processing can reuse normalized timestamps, services, categories,
        and warning lists.
        """
        date_preparations = self._prepare_dates([
            row.get(self.column_map.date) for row in rows
        ])

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/case_parser/types.py
Comment on lines 11 to 14
# Scalar values from pandas DataFrames, Excel cells, or dict lookups.
# pd.NA/pd.NaT are handled via pd.isna() at runtime.
type Scalar = str | int | float | bool | None

Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scalar is defined as only primitive types, but downstream code (e.g., date parsing in CaseProcessor) explicitly supports pd.Timestamp / datetime values from DataFrames. Keeping Scalar = str | int | float | bool | None makes the new type annotations misleading and will force type: ignore / incorrect narrowing elsewhere. Consider broadening Scalar (or introducing a separate alias like CellValue) to include datetime and pd.Timestamp (and possibly other common pandas scalar types) so annotations match actual runtime inputs.

Copilot uses AI. Check for mistakes.
Comment on lines 94 to 127
def build_feature_inputs(
procedure_texts: Sequence[Scalar],
services_list: Sequence[Scalar | list[Scalar]] | None = None,
rule_categories: Sequence[Scalar] | None = None,
rule_warning_counts: Sequence[Scalar] | None = None,
) -> list[FeatureInput]:
"""Build batched FeatureInput objects from parallel procedure metadata."""
expected_length = len(procedure_texts)
services_seq = _normalize_parallel_values(
name="services_list",
values=services_list,
default_value="",
expected_length=expected_length,
)
categories_seq = _normalize_parallel_values(
name="rule_categories",
values=rule_categories,
default_value="",
expected_length=expected_length,
)
warnings_seq = _normalize_parallel_values(
name="rule_warning_counts",
values=rule_warning_counts,
default_value=0,
expected_length=expected_length,
)

return [
FeatureInput(
procedure_text=coerce_text(procedure_text),
procedure_text=clean_text(procedure_text),
service_text=coerce_service_text(service_text),
rule_category=coerce_text(rule_category),
rule_warning_count=parse_int(rule_warning_count),
rule_category=clean_text(rule_category), # type: ignore[arg-type]
rule_warning_count=parse_int(rule_warning_count), # type: ignore[arg-type]
)
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_feature_inputs() currently needs # type: ignore[arg-type] because _normalize_parallel_values() returns Sequence[Scalar | list[Scalar]] for all parallel sequences, even when a given input (like rule_categories / rule_warning_counts) can never contain lists. This is avoidable and weakens the type-safety refactor. Consider making _normalize_parallel_values() generic (preserve the element type of values/default_value) or splitting into separate helpers so rule_category and rule_warning_count have precise scalar types and the ignores can be removed.

Copilot uses AI. Check for mistakes.
Comment thread src/case_parser/logging_config.py Outdated
Comment on lines 7 to 11
from typing import TYPE_CHECKING, Literal

if TYPE_CHECKING:
pass

Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TYPE_CHECKING is imported but the guarded block is just pass, so it adds noise without providing any typing-only imports. Suggest removing the TYPE_CHECKING import and the empty guard block (or add the intended type-only imports) to keep this module minimal.

Suggested change
from typing import TYPE_CHECKING, Literal
if TYPE_CHECKING:
pass
from typing import Literal

Copilot uses AI. Check for mistakes.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 16, 2026

Walkthrough

Centralizes utilities for text cleaning, missing-value detection, and LRU cache sizing; adds PEP 561 type marker and AGENTS.md docs; makes ML model path configurable via environment; introduces typed changes and configurable process-pool chunking in the processor; removes duplicated helpers and tightens type signatures and logging behavior.

Changes

Cohort / File(s) Summary
Documentation & Type Marker
AGENTS.md, src/case_parser/py.typed
Adds repository developer documentation and a PEP 561 marker file.
Shared Types & Utilities
src/case_parser/types.py, src/case_parser/utils.py
Adds/extends Scalar alias, introduces is_missing_scalar(...), and clean_text(...). Adds LRU_CACHE_SIZE and missing-text sentinel sets.
Extractor & Pattern Updates
src/case_parser/extractors.py, src/case_parser/patterns/.../approach_patterns.py, src/case_parser/patterns/.../block_site_patterns.py, src/case_parser/patterns/.../categorization.py
Removes local missing/clean helpers; delegates to is_missing_scalar/clean_text. Replaces hardcoded @lru_cache sizes with LRU_CACHE_SIZE. Updates clean_names() signature to `Scalar
ML Code Changes
src/case_parser/ml/features.py, src/case_parser/ml/hybrid.py, src/case_parser/ml/inputs.py, src/case_parser/ml/loader.py
Uses LRU_CACHE_SIZE for cached functions, centralizes rule/ML confidence thresholds as constants, routes text normalization through clean_text, and makes model path selectable via CASE_PARSER_MODEL_PATH.
Processor Core
src/case_parser/processor.py
Adds _EPISODE_ID_MAX_LENGTH constant, configurable process_pool_min_rows and process_pool_target_chunk_rows constructor parameters, tighter typed representations and method signatures, replaces ad-hoc missing checks with is_missing_scalar, and adjusts process-pool chunking flow and error logging.
Logging & Exceptions
src/case_parser/logging_config.py, src/case_parser/exceptions.py
Replaces dynamic logging-level lookup with a static _LOG_LEVELS mapping; removes redundant pass and __future__ import in exceptions. Public signatures unchanged.
Tests
tests/test_enhanced_processor.py
Lowers caplog contexts from ERROR to WARNING in three tests and updates assertions to reflect fallback logging expectations.
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'refactor: consolidate utilities and improve type safety' accurately reflects the main changes: consolidating duplicate utilities (e.g., clean_text, is_missing_scalar) across modules and enhancing type annotations throughout the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 93.33% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/case_parser/patterns/categorization.py (1)

375-379: ⚠️ Potential issue | 🟠 Major

Replace remaining pd.isna null check with is_missing_scalar.

Line 377 still uses pd.isna, which leaves null-check behavior inconsistent within this module after the shared-helper migration.

✅ Proposed fix
 def _normalize_procedure_text(procedure: str | None) -> str:
     """Normalize optional procedure text to uppercase for caching."""
-    if pd.isna(procedure):
+    if is_missing_scalar(procedure):
         return ""
     return str(procedure).upper()
As per coding guidelines, "Use `is_missing_scalar()` function from `types.py` for null checks instead of manual `pd.isna()` calls".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/case_parser/patterns/categorization.py` around lines 375 - 379, In
_normalize_procedure_text, replace the pd.isna(procedure) null check with the
shared helper is_missing_scalar from types.py: call is_missing_scalar(procedure)
and return "" for missing values; also add the appropriate import for
is_missing_scalar so the function compiles and behavior remains consistent with
other null checks in this module.
src/case_parser/processor.py (1)

357-359: 🧹 Nitpick | 🔵 Trivial

Inconsistent null check pattern.

Line 357 uses pd.isna(value) while the coding guidelines specify using is_missing_scalar() for null checks. For consistency with lines 241 and 286, consider using is_missing_scalar(value) here as well.

♻️ Proposed fix
     `@staticmethod`
     def normalize_emergent_flag(value: Scalar) -> bool:
-        if pd.isna(value):
+        if is_missing_scalar(value):
             return False
         return str(value).strip().upper() in {"E", "Y", "YES", "TRUE", "1"}

As per coding guidelines: "Use is_missing_scalar() function from types.py for null checks instead of manual pd.isna() calls"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/case_parser/processor.py` around lines 357 - 359, The null check uses
pd.isna(value); replace it with the standard helper is_missing_scalar(value) to
match the pattern used on lines 241 and 286 (and to follow coding guidelines).
Update the conditional in the function that checks boolean-like strings (the
block that returns False for missing values and then checks
str(value).strip().upper()) to call is_missing_scalar(value) instead of
pd.isna(value), and ensure is_missing_scalar is imported from types.py if not
already.
src/case_parser/ml/hybrid.py (1)

303-313: 🛠️ Refactor suggestion | 🟠 Major

Inconsistent confidence value usage.

Lines 307, 362, and 400 still use hardcoded 0.8 if rule_warnings else 1.0 instead of the new RULES_CONFIDENCE_WITH_WARNINGS / RULES_CONFIDENCE_WITHOUT_WARNINGS constants. This creates inconsistency with lines 219-221 and 283-285 which correctly use the constants.

♻️ Proposed fix to use constants consistently
         # If ML returns invalid category, fall back to rules
         return ClassificationResult(
             category=rule_category,
             method="rules",
-            confidence=0.8 if rule_warnings else 1.0,
+            confidence=RULES_CONFIDENCE_WITH_WARNINGS
+            if rule_warnings
+            else RULES_CONFIDENCE_WITHOUT_WARNINGS,
             alternative=None,
             warnings=[
                 *rule_warnings,
                 f"ML returned invalid category: {ml_category_str}",
             ],
         )
         return ClassificationResult(
             category=rule_category,
             method="rules",
-            confidence=0.8 if rule_warnings else 1.0,
+            confidence=RULES_CONFIDENCE_WITH_WARNINGS
+            if rule_warnings
+            else RULES_CONFIDENCE_WITHOUT_WARNINGS,
             alternative=ml_category,
             warnings=warnings,
         )
     # Low confidence ML - use rules only
     return ClassificationResult(
         category=rule_category,
         method="rules",
-        confidence=0.8 if rule_warnings else 1.0,
+        confidence=RULES_CONFIDENCE_WITH_WARNINGS
+        if rule_warnings
+        else RULES_CONFIDENCE_WITHOUT_WARNINGS,
         alternative=None,
         warnings=rule_warnings,
     )

Also applies to: 359-365, 396-403

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/case_parser/ml/hybrid.py` around lines 303 - 313, Replace the hardcoded
confidence expression "0.8 if rule_warnings else 1.0" used when returning a
ClassificationResult with the project constants RULES_CONFIDENCE_WITH_WARNINGS /
RULES_CONFIDENCE_WITHOUT_WARNINGS; specifically update the return sites that
build ClassificationResult (the branches that fall back to rules when ML is
invalid or similar) to use "RULES_CONFIDENCE_WITH_WARNINGS if rule_warnings else
RULES_CONFIDENCE_WITHOUT_WARNINGS" so confidence is consistent with the other
usages; ensure you update all occurrences where ClassificationResult is created
with the hardcoded ternary (references: ClassificationResult, rule_warnings,
ml_category_str).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/case_parser/logging_config.py`:
- Around line 9-10: Remove the dead if TYPE_CHECKING: pass block from
logging_config.py; delete the block and also remove the now-unused TYPE_CHECKING
import (if present) so there is no empty conditional or unused symbol in the
module; ensure no other code relies on TYPE_CHECKING before committing the
change.

In `@src/case_parser/ml/inputs.py`:
- Around line 125-126: The type ignores on the calls to
clean_text(rule_category) and parse_int(rule_warning_count) arise because
_normalize_parallel_values can return Sequence[Scalar | list[Scalar]] rather
than a single Scalar; update _normalize_parallel_values to narrow its return
type to Scalar where appropriate (or add an explicit runtime assertion/cast
after calling _normalize_parallel_values) so callers like clean_text and
parse_int receive a confirmed Scalar without needing "# type: ignore[arg-type]";
specifically, adjust the implementation of _normalize_parallel_values (or wrap
its result before passing into clean_text and parse_int) to assert/convert to
the expected Scalar type for rule_category and rule_warning_count and remove the
type ignores.

In `@src/case_parser/utils.py`:
- Around line 52-53: The sentinel check using "if text in
_MISSING_TEXT_SENTINELS" is case-sensitive; update the logic (in the function
containing that check in src/case_parser/utils.py) to perform a case-insensitive
comparison by normalizing the input and sentinels (e.g., compute key =
text.strip().strip('<>').casefold() and compare against a precomputed set of
casefolded sentinels like {s.casefold().strip('<>') for s in
_MISSING_TEXT_SENTINELS}); replace the original membership test with "if key in
normalized_missing_sentinels: return ''" so values like "NAN", "None", or "<na>"
are recognized.

---

Outside diff comments:
In `@src/case_parser/ml/hybrid.py`:
- Around line 303-313: Replace the hardcoded confidence expression "0.8 if
rule_warnings else 1.0" used when returning a ClassificationResult with the
project constants RULES_CONFIDENCE_WITH_WARNINGS /
RULES_CONFIDENCE_WITHOUT_WARNINGS; specifically update the return sites that
build ClassificationResult (the branches that fall back to rules when ML is
invalid or similar) to use "RULES_CONFIDENCE_WITH_WARNINGS if rule_warnings else
RULES_CONFIDENCE_WITHOUT_WARNINGS" so confidence is consistent with the other
usages; ensure you update all occurrences where ClassificationResult is created
with the hardcoded ternary (references: ClassificationResult, rule_warnings,
ml_category_str).

In `@src/case_parser/patterns/categorization.py`:
- Around line 375-379: In _normalize_procedure_text, replace the
pd.isna(procedure) null check with the shared helper is_missing_scalar from
types.py: call is_missing_scalar(procedure) and return "" for missing values;
also add the appropriate import for is_missing_scalar so the function compiles
and behavior remains consistent with other null checks in this module.

In `@src/case_parser/processor.py`:
- Around line 357-359: The null check uses pd.isna(value); replace it with the
standard helper is_missing_scalar(value) to match the pattern used on lines 241
and 286 (and to follow coding guidelines). Update the conditional in the
function that checks boolean-like strings (the block that returns False for
missing values and then checks str(value).strip().upper()) to call
is_missing_scalar(value) instead of pd.isna(value), and ensure is_missing_scalar
is imported from types.py if not already.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8186381a-93c3-4795-969f-513d9c626383

📥 Commits

Reviewing files that changed from the base of the PR and between 4a4ebf6 and 4a7f0ad.

📒 Files selected for processing (16)
  • AGENTS.md
  • src/case_parser/exceptions.py
  • src/case_parser/extractors.py
  • src/case_parser/logging_config.py
  • src/case_parser/ml/features.py
  • src/case_parser/ml/hybrid.py
  • src/case_parser/ml/inputs.py
  • src/case_parser/ml/loader.py
  • src/case_parser/patterns/approach_patterns.py
  • src/case_parser/patterns/block_site_patterns.py
  • src/case_parser/patterns/categorization.py
  • src/case_parser/processor.py
  • src/case_parser/py.typed
  • src/case_parser/types.py
  • src/case_parser/utils.py
  • tests/test_enhanced_processor.py
💤 Files with no reviewable changes (1)
  • src/case_parser/exceptions.py

Comment thread src/case_parser/logging_config.py Outdated
Comment thread src/case_parser/ml/inputs.py Outdated
Comment thread src/case_parser/utils.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/case_parser/processor.py (1)

691-698: 🧹 Nitpick | 🔵 Trivial

Consider using is_missing_scalar for consistency.

Line 691 uses pd.notna(metadata.procedure_text) while the rest of the file has migrated to is_missing_scalar(). For consistency, consider:

if not is_missing_scalar(metadata.procedure_text) and str(metadata.procedure_text).strip():

However, since metadata.procedure_text is already typed as str | None (not a raw scalar from DataFrame), and the str(...).strip() check handles None safely, the current approach is functionally correct.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/case_parser/processor.py` around lines 691 - 698, The code uses
pd.notna(metadata.procedure_text) at the start of the procedure block which is
inconsistent with the rest of the file's use of is_missing_scalar; replace the
pd.notna check with a guard using is_missing_scalar (i.e., if not
is_missing_scalar(metadata.procedure_text) and
str(metadata.procedure_text).strip():) to match conventions around
scalar/missing checks for metadata.procedure_text while leaving the subsequent
calls to extract_monitoring, the loop that appends into monitoring, and the call
to self._extend_findings(all_findings, confidence_scores) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/case_parser/ml/inputs.py`:
- Around line 17-18: Remove the now-unused TypeVar declaration T = TypeVar("T",
bound=Scalar | list[Scalar]) since _normalize_parallel_values uses the PEP 695
inline type parameter syntax; delete that line and tidy up any related unused
imports (e.g., TypeVar) in the module to avoid lints and unused-symbol warnings.

In `@src/case_parser/utils.py`:
- Around line 41-55: clean_text currently returns the string "NaT" for pandas
NaT because "NaT" isn't in the sentinel set; update the sentinel normalization
so pandas' NaT is treated as missing by adding the lowercased "nat" (or "NaT")
to the sentinel collection referenced by clean_text (e.g., _NORMALIZED_SENTINELS
/ _MISSING_TEXT_SENTINELS) or alternatively detect pandas.NaT before
stringifying in clean_text; ensure the function uses the same sentinel constant
(_NORMALIZED_SENTINELS) and includes "nat" (case-insensitive) so "NaT" is
normalized to an empty string.

---

Outside diff comments:
In `@src/case_parser/processor.py`:
- Around line 691-698: The code uses pd.notna(metadata.procedure_text) at the
start of the procedure block which is inconsistent with the rest of the file's
use of is_missing_scalar; replace the pd.notna check with a guard using
is_missing_scalar (i.e., if not is_missing_scalar(metadata.procedure_text) and
str(metadata.procedure_text).strip():) to match conventions around
scalar/missing checks for metadata.procedure_text while leaving the subsequent
calls to extract_monitoring, the loop that appends into monitoring, and the call
to self._extend_findings(all_findings, confidence_scores) unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d6c650aa-d80c-4786-b6b5-1493173d17d1

📥 Commits

Reviewing files that changed from the base of the PR and between 4a7f0ad and 4d7eb9d.

📒 Files selected for processing (7)
  • src/case_parser/logging_config.py
  • src/case_parser/ml/hybrid.py
  • src/case_parser/ml/inputs.py
  • src/case_parser/patterns/categorization.py
  • src/case_parser/processor.py
  • src/case_parser/types.py
  • src/case_parser/utils.py

Comment on lines +17 to +18
T = TypeVar("T", bound=Scalar | list[Scalar])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Unused TypeVar definition after PEP 695 migration.

The T = TypeVar("T", bound=Scalar | list[Scalar]) on line 17 appears to be dead code. The function _normalize_parallel_values at line 157 now uses PEP 695 inline type parameter syntax [T: Scalar | list[Scalar]], making this explicit TypeVar definition unnecessary.

🧹 Proposed cleanup
-from typing import TYPE_CHECKING, TypeVar
+from typing import TYPE_CHECKING

 from ..types import Scalar
 from ..utils import clean_text
 from .config import SERVICE_COLUMN_CANDIDATES

 if TYPE_CHECKING:
     from pandas import DataFrame
-
-
-T = TypeVar("T", bound=Scalar | list[Scalar])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
T = TypeVar("T", bound=Scalar | list[Scalar])
from typing import TYPE_CHECKING
from ..types import Scalar
from ..utils import clean_text
from .config import SERVICE_COLUMN_CANDIDATES
if TYPE_CHECKING:
from pandas import DataFrame
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/case_parser/ml/inputs.py` around lines 17 - 18, Remove the now-unused
TypeVar declaration T = TypeVar("T", bound=Scalar | list[Scalar]) since
_normalize_parallel_values uses the PEP 695 inline type parameter syntax; delete
that line and tidy up any related unused imports (e.g., TypeVar) in the module
to avoid lints and unused-symbol warnings.

Comment thread src/case_parser/utils.py
Comment on lines +41 to +55
def clean_text(value: Scalar | None) -> str:
"""Normalize text values to a plain string or empty string.

Args:
value: Any scalar value or None.

Returns:
Stripped string with missing-value sentinels normalized to empty string.
"""
if value is None:
return ""
text = str(value).strip()
if text.strip("<>").casefold() in _NORMALIZED_SENTINELS:
return ""
return text
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing sentinel for pd.NaT string representation.

When pd.NaT is passed to clean_text(), it stringifies to "NaT", which is not in _MISSING_TEXT_SENTINELS. This could cause "NaT" to be returned as valid text instead of being normalized to "".

💡 Proposed fix
-_MISSING_TEXT_SENTINELS = {"", "<NA>", "nan", "NaN", "None"}
+_MISSING_TEXT_SENTINELS = {"", "<NA>", "nan", "NaN", "None", "NaT"}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/case_parser/utils.py` around lines 41 - 55, clean_text currently returns
the string "NaT" for pandas NaT because "NaT" isn't in the sentinel set; update
the sentinel normalization so pandas' NaT is treated as missing by adding the
lowercased "nat" (or "NaT") to the sentinel collection referenced by clean_text
(e.g., _NORMALIZED_SENTINELS / _MISSING_TEXT_SENTINELS) or alternatively detect
pandas.NaT before stringifying in clean_text; ensure the function uses the same
sentinel constant (_NORMALIZED_SENTINELS) and includes "nat" (case-insensitive)
so "NaT" is normalized to an empty string.

@codacy-production
Copy link
Copy Markdown

Not up to standards ⛔

🔴 Issues 2 high · 2 medium

Alerts:
⚠ 4 issues (≤ 0 issues of at least minor severity)

Results:
4 new issues

Category Results
BestPractice 1 medium
ErrorProne 1 medium
2 high

View in Codacy

🟢 Metrics 534 complexity · 4 duplication

Metric Results
Complexity 534
Duplication 4

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants