refactor: consolidate utilities and improve type safety by razvanazamfirei · Pull Request #20 · razvanazamfirei/acgme-case-parser

razvanazamfirei · 2026-04-16T02:30:27Z

Summary by CodeRabbit

New Features
- Environment-variable ML model path support and new processor configuration options for process-pool chunking.
- Added a public typing marker to enable packaged type information.
Improvements
- Unified text-cleaning and missing-value detection across the codebase.
- Tunable cache sizing and standardized confidence constants for ML/rule logic.
- Tightened typing on processor inputs/outputs and improved logging behavior.
Documentation
- Added developer guidelines covering repository conventions, tooling, and testing.
Tests
- Adjusted log-level expectations in processor tests.

- Centralize is_missing_scalar() in types.py and clean_text() in utils.py - Add named confidence constants in hybrid.py (replaces hardcoded 0.8/0.85/1.0) - Add LRU_CACHE_SIZE constant to avoid magic numbers - Add py.typed marker for PEP 561 compliance - Remove unnecessary pass statements from exception classes - Improve processor chunk configuration via instance parameters - Enhance logging error messages with context

codecov · 2026-04-16T02:31:03Z

Codecov Report

❌ Patch coverage is 94.79167% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.85%. Comparing base (037d065) to head (4d7eb9d).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/case_parser/ml/loader.py	60.00%	2 Missing ⚠️
src/case_parser/logging_config.py	50.00%	1 Missing ⚠️
src/case_parser/ml/hybrid.py	75.00%	0 Missing and 1 partial ⚠️
src/case_parser/ml/inputs.py	88.88%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #20      +/-   ##
==========================================
- Coverage   82.78%   78.85%   -3.93%     
==========================================
  Files          31       31              
  Lines        2219     2346     +127     
  Branches      325      347      +22     
==========================================
+ Hits         1837     1850      +13     
- Misses        311      426     +115     
+ Partials       71       70       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copilot

Pull request overview

This PR refactors case_parser to centralize common utilities (missing-value checks, text normalization, cache sizing), tighten type annotations across the processor/ML pipeline, and improve configurability of ML model loading.

Changes:

Consolidates “missing scalar” checks and text normalization into shared helpers (is_missing_scalar, clean_text) and standardizes lru_cache sizing via LRU_CACHE_SIZE.
Improves type annotations throughout CaseProcessor and extraction/ML helpers, and adds a py.typed marker for PEP 561 typing support.
Adds CASE_PARSER_MODEL_PATH environment variable support for selecting the ML model file at runtime.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/test_enhanced_processor.py	Updates log expectations for process-chunk fallback behavior.
src/case_parser/utils.py	Introduces shared `LRU_CACHE_SIZE` and `clean_text()` helper.
src/case_parser/types.py	Adds `is_missing_scalar()` helper and related imports.
src/case_parser/py.typed	Marks package as typed (PEP 561).
src/case_parser/processor.py	Refactors missing checks/types, adds configurable process-pool thresholds, and refines fallback logging.
src/case_parser/patterns/categorization.py	Reuses shared missing-scalar helper and centralized cache size constant.
src/case_parser/patterns/block_site_patterns.py	Replaces local text cleaning with shared `clean_text()`.
src/case_parser/patterns/approach_patterns.py	Uses shared `LRU_CACHE_SIZE` for cached detectors.
src/case_parser/ml/loader.py	Adds env-var based ML model path resolution.
src/case_parser/ml/inputs.py	Switches to shared text cleaning and adjusts feature input normalization.
src/case_parser/ml/hybrid.py	Replaces magic numbers with named constants and improves Protocol docstrings.
src/case_parser/ml/features.py	Uses shared `LRU_CACHE_SIZE` for cached feature extraction.
src/case_parser/logging_config.py	Replaces `getattr` lookup with explicit log-level map.
src/case_parser/extractors.py	Reuses shared missing-scalar logic and tightens input types.
src/case_parser/exceptions.py	Simplifies exception definitions (removes redundant `pass`).
AGENTS.md	Adds contributor guidance (commands, architecture, typing/testing conventions).

Comments suppressed due to low confidence (1)

src/case_parser/processor.py:813

These row/value annotations assume every DataFrame cell is Scalar, but row.get(self.column_map.date) commonly yields pd.Timestamp/datetime values (and the implementation explicitly handles them via pd.to_datetime / _normalize_timestamp_to_utc). This mismatch undermines the PR’s type-safety goal. Either widen the row/value types here (e.g., to object / a broader alias) or expand Scalar so the annotations reflect the actual inputs.

    def _prepare_rows(
        self, rows: Sequence[Mapping[Hashable, Scalar]]
    ) -> list[_PreparedRow]:
        """Precompute per-row metadata for a dataframe batch.

        This batches date parsing and hybrid categorization so downstream
        row-processing can reuse normalized timestamps, services, categories,
        and warning lists.
        """
        date_preparations = self._prepare_dates([
            row.get(self.column_map.date) for row in rows
        ])

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-16T02:34:37Z

 # Scalar values from pandas DataFrames, Excel cells, or dict lookups.
 # pd.NA/pd.NaT are handled via pd.isna() at runtime.
 type Scalar = str | int | float | bool | None



Scalar is defined as only primitive types, but downstream code (e.g., date parsing in CaseProcessor) explicitly supports pd.Timestamp / datetime values from DataFrames. Keeping Scalar = str | int | float | bool | None makes the new type annotations misleading and will force type: ignore / incorrect narrowing elsewhere. Consider broadening Scalar (or introducing a separate alias like CellValue) to include datetime and pd.Timestamp (and possibly other common pandas scalar types) so annotations match actual runtime inputs.

Copilot · 2026-04-16T02:34:38Z

 def build_feature_inputs(
    procedure_texts: Sequence[Scalar],
    services_list: Sequence[Scalar | list[Scalar]] | None = None,
    rule_categories: Sequence[Scalar] | None = None,
    rule_warning_counts: Sequence[Scalar] | None = None,
 ) -> list[FeatureInput]:
    """Build batched FeatureInput objects from parallel procedure metadata."""
    expected_length = len(procedure_texts)
    services_seq = _normalize_parallel_values(
        name="services_list",
        values=services_list,
        default_value="",
        expected_length=expected_length,
    )
    categories_seq = _normalize_parallel_values(
        name="rule_categories",
        values=rule_categories,
        default_value="",
        expected_length=expected_length,
    )
    warnings_seq = _normalize_parallel_values(
        name="rule_warning_counts",
        values=rule_warning_counts,
        default_value=0,
        expected_length=expected_length,
    )

    return [
        FeatureInput(
-            procedure_text=coerce_text(procedure_text),
+            procedure_text=clean_text(procedure_text),
            service_text=coerce_service_text(service_text),
-            rule_category=coerce_text(rule_category),
-            rule_warning_count=parse_int(rule_warning_count),
+            rule_category=clean_text(rule_category),  # type: ignore[arg-type]
+            rule_warning_count=parse_int(rule_warning_count),  # type: ignore[arg-type]
        )


build_feature_inputs() currently needs # type: ignore[arg-type] because _normalize_parallel_values() returns Sequence[Scalar | list[Scalar]] for all parallel sequences, even when a given input (like rule_categories / rule_warning_counts) can never contain lists. This is avoidable and weakens the type-safety refactor. Consider making _normalize_parallel_values() generic (preserve the element type of values/default_value) or splitting into separate helpers so rule_category and rule_warning_count have precise scalar types and the ignores can be removed.

Copilot · 2026-04-16T02:34:38Z

+from typing import TYPE_CHECKING, Literal
+
+if TYPE_CHECKING:
+    pass



TYPE_CHECKING is imported but the guarded block is just pass, so it adds noise without providing any typing-only imports. Suggest removing the TYPE_CHECKING import and the empty guard block (or add the intended type-only imports) to keep this module minimal.

Suggested change

from typing import TYPE_CHECKING, Literal

if TYPE_CHECKING:

pass

from typing import Literal

coderabbitai · 2026-04-16T02:34:59Z

Walkthrough

Centralizes utilities for text cleaning, missing-value detection, and LRU cache sizing; adds PEP 561 type marker and AGENTS.md docs; makes ML model path configurable via environment; introduces typed changes and configurable process-pool chunking in the processor; removes duplicated helpers and tightens type signatures and logging behavior.

Changes

Cohort / File(s)	Summary
Documentation & Type Marker `AGENTS.md`, `src/case_parser/py.typed`	Adds repository developer documentation and a PEP 561 marker file.
Shared Types & Utilities `src/case_parser/types.py`, `src/case_parser/utils.py`	Adds/extends `Scalar` alias, introduces `is_missing_scalar(...)`, and `clean_text(...)`. Adds `LRU_CACHE_SIZE` and missing-text sentinel sets.
Extractor & Pattern Updates `src/case_parser/extractors.py`, `src/case_parser/patterns/.../approach_patterns.py`, `src/case_parser/patterns/.../block_site_patterns.py`, `src/case_parser/patterns/.../categorization.py`	Removes local missing/clean helpers; delegates to `is_missing_scalar`/`clean_text`. Replaces hardcoded `@lru_cache` sizes with `LRU_CACHE_SIZE`. Updates `clean_names()` signature to `Scalar
ML Code Changes `src/case_parser/ml/features.py`, `src/case_parser/ml/hybrid.py`, `src/case_parser/ml/inputs.py`, `src/case_parser/ml/loader.py`	Uses `LRU_CACHE_SIZE` for cached functions, centralizes rule/ML confidence thresholds as constants, routes text normalization through `clean_text`, and makes model path selectable via `CASE_PARSER_MODEL_PATH`.
Processor Core `src/case_parser/processor.py`	Adds `_EPISODE_ID_MAX_LENGTH` constant, configurable `process_pool_min_rows` and `process_pool_target_chunk_rows` constructor parameters, tighter typed representations and method signatures, replaces ad-hoc missing checks with `is_missing_scalar`, and adjusts process-pool chunking flow and error logging.
Logging & Exceptions `src/case_parser/logging_config.py`, `src/case_parser/exceptions.py`	Replaces dynamic logging-level lookup with a static `_LOG_LEVELS` mapping; removes redundant `pass` and `__future__` import in exceptions. Public signatures unchanged.
Tests `tests/test_enhanced_processor.py`	Lowers caplog contexts from `ERROR` to `WARNING` in three tests and updates assertions to reflect fallback logging expectations.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'refactor: consolidate utilities and improve type safety' accurately reflects the main changes: consolidating duplicate utilities (e.g., clean_text, is_missing_scalar) across modules and enhancing type annotations throughout the codebase.
Docstring Coverage	✅ Passed	Docstring coverage is 93.33% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

src/case_parser/patterns/categorization.py (1)

375-379: ⚠️ Potential issue | 🟠 Major

Replace remaining pd.isna null check with is_missing_scalar.

Line 377 still uses pd.isna, which leaves null-check behavior inconsistent within this module after the shared-helper migration.
✅ Proposed fix
 def _normalize_procedure_text(procedure: str | None) -> str:
     """Normalize optional procedure text to uppercase for caching."""
-    if pd.isna(procedure):
+    if is_missing_scalar(procedure):
         return ""
     return str(procedure).upper()
As per coding guidelines, "Use `is_missing_scalar()` function from `types.py` for null checks instead of manual `pd.isna()` calls".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/case_parser/patterns/categorization.py` around lines 375 - 379, In
_normalize_procedure_text, replace the pd.isna(procedure) null check with the
shared helper is_missing_scalar from types.py: call is_missing_scalar(procedure)
and return "" for missing values; also add the appropriate import for
is_missing_scalar so the function compiles and behavior remains consistent with
other null checks in this module.

src/case_parser/processor.py (1)

357-359: 🧹 Nitpick | 🔵 Trivial

Inconsistent null check pattern.

Line 357 uses pd.isna(value) while the coding guidelines specify using is_missing_scalar() for null checks. For consistency with lines 241 and 286, consider using is_missing_scalar(value) here as well.
♻️ Proposed fix
     `@staticmethod`
     def normalize_emergent_flag(value: Scalar) -> bool:
-        if pd.isna(value):
+        if is_missing_scalar(value):
             return False
         return str(value).strip().upper() in {"E", "Y", "YES", "TRUE", "1"}
As per coding guidelines: "Use is_missing_scalar() function from types.py for null checks instead of manual pd.isna() calls"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/case_parser/processor.py` around lines 357 - 359, The null check uses
pd.isna(value); replace it with the standard helper is_missing_scalar(value) to
match the pattern used on lines 241 and 286 (and to follow coding guidelines).
Update the conditional in the function that checks boolean-like strings (the
block that returns False for missing values and then checks
str(value).strip().upper()) to call is_missing_scalar(value) instead of
pd.isna(value), and ensure is_missing_scalar is imported from types.py if not
already.

src/case_parser/ml/hybrid.py (1)

303-313: 🛠️ Refactor suggestion | 🟠 Major

Inconsistent confidence value usage.

Lines 307, 362, and 400 still use hardcoded 0.8 if rule_warnings else 1.0 instead of the new RULES_CONFIDENCE_WITH_WARNINGS / RULES_CONFIDENCE_WITHOUT_WARNINGS constants. This creates inconsistency with lines 219-221 and 283-285 which correctly use the constants.

♻️ Proposed fix to use constants consistently

         # If ML returns invalid category, fall back to rules
         return ClassificationResult(
             category=rule_category,
             method="rules",
-            confidence=0.8 if rule_warnings else 1.0,
+            confidence=RULES_CONFIDENCE_WITH_WARNINGS
+            if rule_warnings
+            else RULES_CONFIDENCE_WITHOUT_WARNINGS,
             alternative=None,
             warnings=[
                 *rule_warnings,
                 f"ML returned invalid category: {ml_category_str}",
             ],
         )

         return ClassificationResult(
             category=rule_category,
             method="rules",
-            confidence=0.8 if rule_warnings else 1.0,
+            confidence=RULES_CONFIDENCE_WITH_WARNINGS
+            if rule_warnings
+            else RULES_CONFIDENCE_WITHOUT_WARNINGS,
             alternative=ml_category,
             warnings=warnings,
         )

     # Low confidence ML - use rules only
     return ClassificationResult(
         category=rule_category,
         method="rules",
-        confidence=0.8 if rule_warnings else 1.0,
+        confidence=RULES_CONFIDENCE_WITH_WARNINGS
+        if rule_warnings
+        else RULES_CONFIDENCE_WITHOUT_WARNINGS,
         alternative=None,
         warnings=rule_warnings,
     )

Also applies to: 359-365, 396-403

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/case_parser/ml/hybrid.py` around lines 303 - 313, Replace the hardcoded
confidence expression "0.8 if rule_warnings else 1.0" used when returning a
ClassificationResult with the project constants RULES_CONFIDENCE_WITH_WARNINGS /
RULES_CONFIDENCE_WITHOUT_WARNINGS; specifically update the return sites that
build ClassificationResult (the branches that fall back to rules when ML is
invalid or similar) to use "RULES_CONFIDENCE_WITH_WARNINGS if rule_warnings else
RULES_CONFIDENCE_WITHOUT_WARNINGS" so confidence is consistent with the other
usages; ensure you update all occurrences where ClassificationResult is created
with the hardcoded ternary (references: ClassificationResult, rule_warnings,
ml_category_str).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/case_parser/logging_config.py`:
- Around line 9-10: Remove the dead if TYPE_CHECKING: pass block from
logging_config.py; delete the block and also remove the now-unused TYPE_CHECKING
import (if present) so there is no empty conditional or unused symbol in the
module; ensure no other code relies on TYPE_CHECKING before committing the
change.

In `@src/case_parser/ml/inputs.py`:
- Around line 125-126: The type ignores on the calls to
clean_text(rule_category) and parse_int(rule_warning_count) arise because
_normalize_parallel_values can return Sequence[Scalar | list[Scalar]] rather
than a single Scalar; update _normalize_parallel_values to narrow its return
type to Scalar where appropriate (or add an explicit runtime assertion/cast
after calling _normalize_parallel_values) so callers like clean_text and
parse_int receive a confirmed Scalar without needing "# type: ignore[arg-type]";
specifically, adjust the implementation of _normalize_parallel_values (or wrap
its result before passing into clean_text and parse_int) to assert/convert to
the expected Scalar type for rule_category and rule_warning_count and remove the
type ignores.

In `@src/case_parser/utils.py`:
- Around line 52-53: The sentinel check using "if text in
_MISSING_TEXT_SENTINELS" is case-sensitive; update the logic (in the function
containing that check in src/case_parser/utils.py) to perform a case-insensitive
comparison by normalizing the input and sentinels (e.g., compute key =
text.strip().strip('<>').casefold() and compare against a precomputed set of
casefolded sentinels like {s.casefold().strip('<>') for s in
_MISSING_TEXT_SENTINELS}); replace the original membership test with "if key in
normalized_missing_sentinels: return ''" so values like "NAN", "None", or "<na>"
are recognized.

---

Outside diff comments:
In `@src/case_parser/ml/hybrid.py`:
- Around line 303-313: Replace the hardcoded confidence expression "0.8 if
rule_warnings else 1.0" used when returning a ClassificationResult with the
project constants RULES_CONFIDENCE_WITH_WARNINGS /
RULES_CONFIDENCE_WITHOUT_WARNINGS; specifically update the return sites that
build ClassificationResult (the branches that fall back to rules when ML is
invalid or similar) to use "RULES_CONFIDENCE_WITH_WARNINGS if rule_warnings else
RULES_CONFIDENCE_WITHOUT_WARNINGS" so confidence is consistent with the other
usages; ensure you update all occurrences where ClassificationResult is created
with the hardcoded ternary (references: ClassificationResult, rule_warnings,
ml_category_str).

In `@src/case_parser/patterns/categorization.py`:
- Around line 375-379: In _normalize_procedure_text, replace the
pd.isna(procedure) null check with the shared helper is_missing_scalar from
types.py: call is_missing_scalar(procedure) and return "" for missing values;
also add the appropriate import for is_missing_scalar so the function compiles
and behavior remains consistent with other null checks in this module.

In `@src/case_parser/processor.py`:
- Around line 357-359: The null check uses pd.isna(value); replace it with the
standard helper is_missing_scalar(value) to match the pattern used on lines 241
and 286 (and to follow coding guidelines). Update the conditional in the
function that checks boolean-like strings (the block that returns False for
missing values and then checks str(value).strip().upper()) to call
is_missing_scalar(value) instead of pd.isna(value), and ensure is_missing_scalar
is imported from types.py if not already.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8186381a-93c3-4795-969f-513d9c626383

📥 Commits

Reviewing files that changed from the base of the PR and between 4a4ebf6 and 4a7f0ad.

📒 Files selected for processing (16)

AGENTS.md
src/case_parser/exceptions.py
src/case_parser/extractors.py
src/case_parser/logging_config.py
src/case_parser/ml/features.py
src/case_parser/ml/hybrid.py
src/case_parser/ml/inputs.py
src/case_parser/ml/loader.py
src/case_parser/patterns/approach_patterns.py
src/case_parser/patterns/block_site_patterns.py
src/case_parser/patterns/categorization.py
src/case_parser/processor.py
src/case_parser/py.typed
src/case_parser/types.py
src/case_parser/utils.py
tests/test_enhanced_processor.py

💤 Files with no reviewable changes (1)

src/case_parser/exceptions.py

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/case_parser/processor.py (1)
691-698: 🧹 Nitpick | 🔵 Trivial

Consider using is_missing_scalar for consistency.

Line 691 uses pd.notna(metadata.procedure_text) while the rest of the file has migrated to is_missing_scalar(). For consistency, consider:
if not is_missing_scalar(metadata.procedure_text) and str(metadata.procedure_text).strip():
However, since metadata.procedure_text is already typed as str | None (not a raw scalar from DataFrame), and the str(...).strip() check handles None safely, the current approach is functionally correct.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/case_parser/processor.py` around lines 691 - 698, The code uses
pd.notna(metadata.procedure_text) at the start of the procedure block which is
inconsistent with the rest of the file's use of is_missing_scalar; replace the
pd.notna check with a guard using is_missing_scalar (i.e., if not
is_missing_scalar(metadata.procedure_text) and
str(metadata.procedure_text).strip():) to match conventions around
scalar/missing checks for metadata.procedure_text while leaving the subsequent
calls to extract_monitoring, the loop that appends into monitoring, and the call
to self._extend_findings(all_findings, confidence_scores) unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/case_parser/ml/inputs.py`:
- Around line 17-18: Remove the now-unused TypeVar declaration T = TypeVar("T",
bound=Scalar | list[Scalar]) since _normalize_parallel_values uses the PEP 695
inline type parameter syntax; delete that line and tidy up any related unused
imports (e.g., TypeVar) in the module to avoid lints and unused-symbol warnings.

In `@src/case_parser/utils.py`:
- Around line 41-55: clean_text currently returns the string "NaT" for pandas
NaT because "NaT" isn't in the sentinel set; update the sentinel normalization
so pandas' NaT is treated as missing by adding the lowercased "nat" (or "NaT")
to the sentinel collection referenced by clean_text (e.g., _NORMALIZED_SENTINELS
/ _MISSING_TEXT_SENTINELS) or alternatively detect pandas.NaT before
stringifying in clean_text; ensure the function uses the same sentinel constant
(_NORMALIZED_SENTINELS) and includes "nat" (case-insensitive) so "NaT" is
normalized to an empty string.

---

Outside diff comments:
In `@src/case_parser/processor.py`:
- Around line 691-698: The code uses pd.notna(metadata.procedure_text) at the
start of the procedure block which is inconsistent with the rest of the file's
use of is_missing_scalar; replace the pd.notna check with a guard using
is_missing_scalar (i.e., if not is_missing_scalar(metadata.procedure_text) and
str(metadata.procedure_text).strip():) to match conventions around
scalar/missing checks for metadata.procedure_text while leaving the subsequent
calls to extract_monitoring, the loop that appends into monitoring, and the call
to self._extend_findings(all_findings, confidence_scores) unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d6c650aa-d80c-4786-b6b5-1493173d17d1

📥 Commits

Reviewing files that changed from the base of the PR and between 4a7f0ad and 4d7eb9d.

📒 Files selected for processing (7)

src/case_parser/logging_config.py
src/case_parser/ml/hybrid.py
src/case_parser/ml/inputs.py
src/case_parser/patterns/categorization.py
src/case_parser/processor.py
src/case_parser/types.py
src/case_parser/utils.py

coderabbitai · 2026-04-16T10:21:09Z

+T = TypeVar("T", bound=Scalar | list[Scalar])
+


🧹 Nitpick | 🔵 Trivial

Unused TypeVar definition after PEP 695 migration.

The T = TypeVar("T", bound=Scalar | list[Scalar]) on line 17 appears to be dead code. The function _normalize_parallel_values at line 157 now uses PEP 695 inline type parameter syntax [T: Scalar | list[Scalar]], making this explicit TypeVar definition unnecessary.

🧹 Proposed cleanup

-from typing import TYPE_CHECKING, TypeVar +from typing import TYPE_CHECKING from ..types import Scalar from ..utils import clean_text from .config import SERVICE_COLUMN_CANDIDATES if TYPE_CHECKING: from pandas import DataFrame - - -T = TypeVar("T", bound=Scalar | list[Scalar])

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

T = TypeVar("T", bound=Scalar | list[Scalar])

from typing import TYPE_CHECKING

from ..types import Scalar

from ..utils import clean_text

from .config import SERVICE_COLUMN_CANDIDATES

if TYPE_CHECKING:

from pandas import DataFrame

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/case_parser/ml/inputs.py` around lines 17 - 18, Remove the now-unused TypeVar declaration T = TypeVar("T", bound=Scalar | list[Scalar]) since _normalize_parallel_values uses the PEP 695 inline type parameter syntax; delete that line and tidy up any related unused imports (e.g., TypeVar) in the module to avoid lints and unused-symbol warnings.

coderabbitai · 2026-04-16T10:21:10Z

+def clean_text(value: Scalar | None) -> str:
+    """Normalize text values to a plain string or empty string.
+
+    Args:
+        value: Any scalar value or None.
+
+    Returns:
+        Stripped string with missing-value sentinels normalized to empty string.
+    """
+    if value is None:
+        return ""
+    text = str(value).strip()
+    if text.strip("<>").casefold() in _NORMALIZED_SENTINELS:
+        return ""
+    return text


⚠️ Potential issue | 🟡 Minor

Missing sentinel for pd.NaT string representation.

When pd.NaT is passed to clean_text(), it stringifies to "NaT", which is not in _MISSING_TEXT_SENTINELS. This could cause "NaT" to be returned as valid text instead of being normalized to "".

💡 Proposed fix

-_MISSING_TEXT_SENTINELS = {"", "<NA>", "nan", "NaN", "None"} +_MISSING_TEXT_SENTINELS = {"", "<NA>", "nan", "NaN", "None", "NaT"}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/case_parser/utils.py` around lines 41 - 55, clean_text currently returns the string "NaT" for pandas NaT because "NaT" isn't in the sentinel set; update the sentinel normalization so pandas' NaT is treated as missing by adding the lowercased "nat" (or "NaT") to the sentinel collection referenced by clean_text (e.g., _NORMALIZED_SENTINELS / _MISSING_TEXT_SENTINELS) or alternatively detect pandas.NaT before stringifying in clean_text; ensure the function uses the same sentinel constant (_NORMALIZED_SENTINELS) and includes "nat" (case-insensitive) so "NaT" is normalized to an empty string.

codacy-production · 2026-04-25T03:34:20Z

Not up to standards ⛔

🔴 Issues 2 high · 2 medium

Alerts:
⚠ 4 issues (≤ 0 issues of at least minor severity)

Results:
4 new issues

Category Results

BestPractice 1 medium

ErrorProne 1 medium
2 high

View in Codacy

🟢 Metrics 534 complexity · 4 duplication

Metric Results

Complexity 534

Duplication 4

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

Copilot AI review requested due to automatic review settings April 16, 2026 02:30

Copilot started reviewing on behalf of razvanazamfirei April 16, 2026 02:30 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

coderabbitai Bot requested changes Apr 16, 2026

View reviewed changes

Comment thread src/case_parser/logging_config.py Outdated

Comment thread src/case_parser/ml/inputs.py Outdated

Comment thread src/case_parser/utils.py Outdated

code review

4d7eb9d

coderabbitai Bot requested changes Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: consolidate utilities and improve type safety#20

refactor: consolidate utilities and improve type safety#20
razvanazamfirei wants to merge 2 commits intomainfrom
code-cleanup

razvanazamfirei commented Apr 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

codecov Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 16, 2026

Uh oh!

coderabbitai Bot Apr 16, 2026

Uh oh!

codacy-production Bot commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-T = TypeVar("T", bound=Scalar | list[Scalar])
+from typing import TYPE_CHECKING
+from ..types import Scalar
+from ..utils import clean_text
+from .config import SERVICE_COLUMN_CANDIDATES
+if TYPE_CHECKING:
+    from pandas import DataFrame

Conversation

razvanazamfirei commented Apr 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

codecov Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

codacy-production Bot commented Apr 25, 2026

Not up to standards ⛔

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

razvanazamfirei commented Apr 16, 2026 •

edited by coderabbitai Bot

Loading

codecov Bot commented Apr 16, 2026 •

edited

Loading

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading