Addition of non-contiguous search and parameterization #116

project-2-2-2 · 2025-10-15T10:15:40Z

This pull request implements the enhancement for non-contiguous pattern matching as we discussed in Issue #115.

Key Changes:

A contiguous parameter has been added to the search method. The default is False to align with the canonical GSP algorithm.
The logic now correctly switches between contiguous and non-contiguous subsequence checking.
The CLI and acceleration layers have been updated to support the new parameter.
All project tests are passing.

Closes #115.

Thank you for the opportunity and your guidance. I'm ready to make any further changes needed.

project-2-2-2 · 2025-10-16T15:39:46Z

edit version

jacksonpradolima

initial (quick) review

jacksonpradolima · 2025-11-04T11:36:47Z

gsppy/accelerate.py


-    cp_flat = cp.asarray(flat, dtype=cp.int32)  # type: ignore[name-defined]
-    counts = cp.bincount(cp_flat, minlength=vocab_size)  # type: ignore[attr-defined]
-    counts_host: Any = counts.get()  # back to host as a NumPy array


why did you remove my comments?

sorry sir i will add it back

jacksonpradolima · 2025-11-04T11:36:55Z

gsppy/accelerate.py

 def _encode_transactions(transactions: List[Tuple[str, ...]]) -> Tuple[List[List[int]], Dict[int, str], Dict[str, int]]:
    """Encode transactions of strings into integer IDs.
-
+    


remove the extra spaces, also check in the other places

jacksonpradolima · 2025-11-04T11:37:24Z

gsppy/accelerate.py

 ) -> Dict[Tuple[str, ...], int]:
-    """Pure-Python fallback for support counting (single-process).
-
+    """Pure-Python  fallback for support counting (single-process).


extra spaces again

jacksonpradolima · 2025-11-04T11:37:42Z

gsppy/accelerate.py

 ) -> Dict[Tuple[str, ...], int]:
-    """Choose the best available backend for support counting.
-
+    """ Choose the best available backend for support counting.


jacksonpradolima · 2025-11-04T11:37:47Z

gsppy/accelerate.py

              fall back to CPU for the rest
    - "python": force pure-Python fallback
-    - otherwise: try Rust first and fall back to Python
+    - otherwise:  try Rust first and fall back to Python


gsppy/gsp.py

tests/test_gsp.py

gsppy/accelerate.py

jacksonpradolima · 2025-11-20T03:21:38Z

@project-2-2-2 please, add a new test:

def test_non_contiguous_multiprocessing():
    # Dataset where ('a','c') is a non‑contiguous subsequence but not a contiguous one.
    sequences = [
        ['a', 'b', 'c'],
        ['a', 'c'],
        ['b', 'c', 'a'],
        ['a', 'b', 'c', 'd'],
    ]
    gsp = GSP(sequences)

    # Use a tiny batch size to force multiple batches and trigger multiprocessing.
    # With the current PR, the multiprocessing worker incorrectly uses the contiguous checker.
    result_non_contig = gsp.search(min_support=0.5, contiguous=False, backend='python', batch_size=1)

    # In non‑contiguous mode, ('a','c') should be considered frequent (support = 3/4).
    assert any(('a', 'c') in level for level in result_non_contig), \
        "Expected to find ('a','c') as a non‑contiguous frequent subsequence"

    # Also verify that contiguous search does not report ('a','c').
    result_contig = gsp.search(min_support=0.5, contiguous=True, backend='python', batch_size=1)
    assert not any(('a', 'c') in level for level in result_contig), \
        "('a','c') should not appear in a strict contiguous search"

jacksonpradolima · 2025-11-20T03:22:59Z

gsppy/gsp.py

            batch_results = pool.starmap(
                self._worker_batch,  # Process a batch at a time
-                [(batch, self.transactions, min_support) for batch in batches],
+                [(batch, self.transactions, min_support,subsequence_checker) for batch in batches],


_worker_batch expects a boolean contiguous flag, not a function. Because every function is truthy, non‑contiguous searches will incorrectly use the contiguous checker. The fix is to pass the boolean contiguous instead, or change _worker_batch to accept a subsequence_checker function.

jacksonpradolima · 2025-11-20T03:25:05Z

gsppy/accelerate.py

                vocab_size=vocab_size,
            )
-            # Map back to original strings
+        # Map back to original strings


the comment does not seems in the right identation

jacksonpradolima · 2025-11-20T03:25:16Z

gsppy/accelerate.py

            out_rust[tuple(inv_vocab[i] for i in enc_cand)] = int(freq)
        return out_rust
-
+    


extra space

jacksonpradolima · 2025-11-20T03:25:44Z

gsppy/accelerate.py

    - otherwise: try Rust first and fall back to Python
    """
+    if not contiguous:
+        return support_counts_python(


The function immediately returns the Python fallback when contiguous is False, even if the user has chosen the Rust or GPU backend. This design silently discards the acceleration path. If that is intentional, it should be clearly documented; otherwise, the accelerators need updating to handle non‑contiguous subsequences.

jacksonpradolima · 2025-11-20T03:26:28Z

gsppy/utils.py

+    if not subsequence:
+        return False
+    it = iter(sequence)
+    return all(item in it for item in subsequence)


It returns False for an empty subsequence. By definition, the empty subsequence exists in every sequence. This edge case doesn’t affect the current algorithm (it never checks empty candidates), but a trivial fix would return True when subsequence is empty. A bounded lru_cache (maxsize) could also prevent unbounded memory use.

sonarqubecloud · 2025-11-24T09:26:58Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Addition of non-contiguous search and parameterization

a9f8fa4

project-2-2-2 requested a review from jacksonpradolima as a code owner October 15, 2025 10:15

project-2-2-2 added 2 commits October 15, 2025 20:54

corrected merge conflict

949682f

Corrected documentation

c3bddd4

jacksonpradolima requested changes Nov 4, 2025

View reviewed changes

project-2-2-2 added 2 commits November 6, 2025 11:31

Fixed code quality

cd24bc7

refactored comments

df92af1

jacksonpradolima reviewed Nov 6, 2025

View reviewed changes

gsppy/accelerate.py Outdated Show resolved Hide resolved

removed unused library

f0b0bde

project-2-2-2 requested a review from jacksonpradolima November 9, 2025 03:19

jacksonpradolima requested changes Nov 20, 2025

View reviewed changes

fixed issues

04becb1

project-2-2-2 requested a review from jacksonpradolima November 25, 2025 11:47

		def _encode_transactions(transactions: List[Tuple[str, ...]]) -> Tuple[List[List[int]], Dict[int, str], Dict[str, int]]:
		"""Encode transactions of strings into integer IDs.

		out_rust[tuple(inv_vocab[i] for i in enc_cand)] = int(freq)
		return out_rust

Uh oh!

Addition of non-contiguous search and parameterization #116

Are you sure you want to change the base?

Addition of non-contiguous search and parameterization #116

Uh oh!

Conversation

project-2-2-2 commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

project-2-2-2 commented Oct 16, 2025

Uh oh!

jacksonpradolima left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jacksonpradolima commented Nov 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Nov 24, 2025

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

project-2-2-2 commented Oct 15, 2025 •

edited

Loading