Skip to content

Conversation

@project-2-2-2
Copy link

@project-2-2-2 project-2-2-2 commented Oct 15, 2025

This pull request implements the enhancement for non-contiguous pattern matching as we discussed in Issue #115.

Key Changes:

  • A contiguous parameter has been added to the search method. The default is False to align with the canonical GSP algorithm.
  • The logic now correctly switches between contiguous and non-contiguous subsequence checking.
  • The CLI and acceleration layers have been updated to support the new parameter.
  • All project tests are passing.

Closes #115.

Thank you for the opportunity and your guidance. I'm ready to make any further changes needed.

@project-2-2-2
Copy link
Author

edit version

Copy link
Owner

@jacksonpradolima jacksonpradolima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial (quick) review


cp_flat = cp.asarray(flat, dtype=cp.int32) # type: ignore[name-defined]
counts = cp.bincount(cp_flat, minlength=vocab_size) # type: ignore[attr-defined]
counts_host: Any = counts.get() # back to host as a NumPy array
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you remove my comments?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry sir i will add it back

def _encode_transactions(transactions: List[Tuple[str, ...]]) -> Tuple[List[List[int]], Dict[int, str], Dict[str, int]]:
"""Encode transactions of strings into integer IDs.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the extra spaces, also check in the other places

) -> Dict[Tuple[str, ...], int]:
"""Pure-Python fallback for support counting (single-process).
"""Pure-Python fallback for support counting (single-process).
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra spaces again

) -> Dict[Tuple[str, ...], int]:
"""Choose the best available backend for support counting.
""" Choose the best available backend for support counting.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

fall back to CPU for the rest
- "python": force pure-Python fallback
- otherwise: try Rust first and fall back to Python
- otherwise: try Rust first and fall back to Python
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

@jacksonpradolima
Copy link
Owner

@project-2-2-2 please, add a new test:

def test_non_contiguous_multiprocessing():
    # Dataset where ('a','c') is a non‑contiguous subsequence but not a contiguous one.
    sequences = [
        ['a', 'b', 'c'],
        ['a', 'c'],
        ['b', 'c', 'a'],
        ['a', 'b', 'c', 'd'],
    ]
    gsp = GSP(sequences)

    # Use a tiny batch size to force multiple batches and trigger multiprocessing.
    # With the current PR, the multiprocessing worker incorrectly uses the contiguous checker.
    result_non_contig = gsp.search(min_support=0.5, contiguous=False, backend='python', batch_size=1)

    # In non‑contiguous mode, ('a','c') should be considered frequent (support = 3/4).
    assert any(('a', 'c') in level for level in result_non_contig), \
        "Expected to find ('a','c') as a non‑contiguous frequent subsequence"

    # Also verify that contiguous search does not report ('a','c').
    result_contig = gsp.search(min_support=0.5, contiguous=True, backend='python', batch_size=1)
    assert not any(('a', 'c') in level for level in result_contig), \
        "('a','c') should not appear in a strict contiguous search"

gsppy/gsp.py Outdated
batch_results = pool.starmap(
self._worker_batch, # Process a batch at a time
[(batch, self.transactions, min_support) for batch in batches],
[(batch, self.transactions, min_support,subsequence_checker) for batch in batches],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_worker_batch expects a boolean contiguous flag, not a function. Because every function is truthy, non‑contiguous searches will incorrectly use the contiguous checker. The fix is to pass the boolean contiguous instead, or change _worker_batch to accept a subsequence_checker function.

vocab_size=vocab_size,
)
# Map back to original strings
# Map back to original strings
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment does not seems in the right identation

out_rust[tuple(inv_vocab[i] for i in enc_cand)] = int(freq)
return out_rust

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space

- otherwise: try Rust first and fall back to Python
"""
if not contiguous:
return support_counts_python(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function immediately returns the Python fallback when contiguous is False, even if the user has chosen the Rust or GPU backend. This design silently discards the acceleration path. If that is intentional, it should be clearly documented; otherwise, the accelerators need updating to handle non‑contiguous subsequences.

if not subsequence:
return False
it = iter(sequence)
return all(item in it for item in subsequence)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns False for an empty subsequence. By definition, the empty subsequence exists in every sequence. This edge case doesn’t affect the current algorithm (it never checks empty candidates), but a trivial fix would return True when subsequence is empty. A bounded lru_cache (maxsize) could also prevent unbounded memory use.

@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GSP fails to detect non-contiguous subsequences (e.g., ('a','c')) due to contiguous matcher

2 participants