-
-
Notifications
You must be signed in to change notification settings - Fork 23
Addition of non-contiguous search and parameterization #116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
edit version |
jacksonpradolima
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initial (quick) review
|
|
||
| cp_flat = cp.asarray(flat, dtype=cp.int32) # type: ignore[name-defined] | ||
| counts = cp.bincount(cp_flat, minlength=vocab_size) # type: ignore[attr-defined] | ||
| counts_host: Any = counts.get() # back to host as a NumPy array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you remove my comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry sir i will add it back
gsppy/accelerate.py
Outdated
| def _encode_transactions(transactions: List[Tuple[str, ...]]) -> Tuple[List[List[int]], Dict[int, str], Dict[str, int]]: | ||
| """Encode transactions of strings into integer IDs. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the extra spaces, also check in the other places
gsppy/accelerate.py
Outdated
| ) -> Dict[Tuple[str, ...], int]: | ||
| """Pure-Python fallback for support counting (single-process). | ||
| """Pure-Python fallback for support counting (single-process). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra spaces again
gsppy/accelerate.py
Outdated
| ) -> Dict[Tuple[str, ...], int]: | ||
| """Choose the best available backend for support counting. | ||
| """ Choose the best available backend for support counting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here
gsppy/accelerate.py
Outdated
| fall back to CPU for the rest | ||
| - "python": force pure-Python fallback | ||
| - otherwise: try Rust first and fall back to Python | ||
| - otherwise: try Rust first and fall back to Python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here
|
@project-2-2-2 please, add a new test: def test_non_contiguous_multiprocessing():
# Dataset where ('a','c') is a non‑contiguous subsequence but not a contiguous one.
sequences = [
['a', 'b', 'c'],
['a', 'c'],
['b', 'c', 'a'],
['a', 'b', 'c', 'd'],
]
gsp = GSP(sequences)
# Use a tiny batch size to force multiple batches and trigger multiprocessing.
# With the current PR, the multiprocessing worker incorrectly uses the contiguous checker.
result_non_contig = gsp.search(min_support=0.5, contiguous=False, backend='python', batch_size=1)
# In non‑contiguous mode, ('a','c') should be considered frequent (support = 3/4).
assert any(('a', 'c') in level for level in result_non_contig), \
"Expected to find ('a','c') as a non‑contiguous frequent subsequence"
# Also verify that contiguous search does not report ('a','c').
result_contig = gsp.search(min_support=0.5, contiguous=True, backend='python', batch_size=1)
assert not any(('a', 'c') in level for level in result_contig), \
"('a','c') should not appear in a strict contiguous search" |
gsppy/gsp.py
Outdated
| batch_results = pool.starmap( | ||
| self._worker_batch, # Process a batch at a time | ||
| [(batch, self.transactions, min_support) for batch in batches], | ||
| [(batch, self.transactions, min_support,subsequence_checker) for batch in batches], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_worker_batch expects a boolean contiguous flag, not a function. Because every function is truthy, non‑contiguous searches will incorrectly use the contiguous checker. The fix is to pass the boolean contiguous instead, or change _worker_batch to accept a subsequence_checker function.
| vocab_size=vocab_size, | ||
| ) | ||
| # Map back to original strings | ||
| # Map back to original strings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment does not seems in the right identation
gsppy/accelerate.py
Outdated
| out_rust[tuple(inv_vocab[i] for i in enc_cand)] = int(freq) | ||
| return out_rust | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra space
| - otherwise: try Rust first and fall back to Python | ||
| """ | ||
| if not contiguous: | ||
| return support_counts_python( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function immediately returns the Python fallback when contiguous is False, even if the user has chosen the Rust or GPU backend. This design silently discards the acceleration path. If that is intentional, it should be clearly documented; otherwise, the accelerators need updating to handle non‑contiguous subsequences.
| if not subsequence: | ||
| return False | ||
| it = iter(sequence) | ||
| return all(item in it for item in subsequence) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It returns False for an empty subsequence. By definition, the empty subsequence exists in every sequence. This edge case doesn’t affect the current algorithm (it never checks empty candidates), but a trivial fix would return True when subsequence is empty. A bounded lru_cache (maxsize) could also prevent unbounded memory use.
|



This pull request implements the enhancement for non-contiguous pattern matching as we discussed in Issue #115.
Key Changes:
contiguousparameter has been added to thesearchmethod. The default isFalseto align with the canonical GSP algorithm.Closes #115.
Thank you for the opportunity and your guidance. I'm ready to make any further changes needed.