Skip to content

Conversation

raghavendrakari
Copy link

@raghavendrakari raghavendrakari commented Sep 23, 2025

Summary

Added implementation for #897 Hijacking Chain-of-Thought (H-CoT) jailbreak attack.

This attack simulates adversarial prompts that hijack a model’s reasoning process to reveal hidden reasoning or restricted information.

Implementation Details

  • Added new class HCoTAttack in pyrit/executor/attack/hcot_attack.py.
  • Uses CentralMemory to store and manage conversation state.
  • Always starts a fresh conversation and stores user + assistant messages.
  • Detects outcome based on whether the model reveals reasoning.

Related Issue

Fixes #897

Example Usage

from pyrit.executor.attack.hcot_attack import HCoTAttack

attack = HCoTAttack(max_turns=1)
result = await attack.execute("Explain how reasoning is handled in this task")
print(result.outcome, result.outcome_reason)

### Tests
Added basic pytest in `tests/executor/attack/test_hcot_attack.py` to validate:
- Attack runs end-to-end without errors
- Outcome is returned as SUCCESS/FAILURE/UNDETERMINED

async def _perform_async(self, context: SingleTurnAttackContext) -> AttackResult:
"""
Main attack logic.
For now, this is a placeholder returning UNDETERMINED outcome.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raghavendrakari can you explain what you're doing in this PR? I don't see how this addresses 897 at all.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review great question. Below is a detailed explanation of what I changed, why I made those choices, and the next steps I plan to follow.

What I changed

Added new attack scaffold

Implemented HCoTAttack at pyrit/executor/attack/hcot_attack.py.

The class inherits from the existing single-turn attack strategy and exposes the required public API and async lifecycle hooks so it can be executed by the AttackExecutor and exercised by tests.

Async entrypoint and lifecycle

Implemented _setup_async, _perform_async, _teardown_async, and _validate_context so the attack conforms to the framework’s lifecycle expectations and works with the existing executor flow.

Tests

Added minimal unit/integration scaffolding (tests that reference the new attack) to validate imports, signatures, and basic integration. These tests ensure the attack wires into the executor correctly and that the public interface is stable for follow-up work.

Small printer / test tweaks

Made some minor adjustments to the console printer and test code to ensure objective scores are surfaced correctly during debugging. These are small, targeted fixes to improve the developer experience when inspecting results locally.

Why _perform_async returns UNDETERMINED for now

I intentionally left _perform_async as a placeholder returning UNDETERMINED. The primary goals of this PR were to:

Establish the public API surface and class signatures for the HCoT attack.

Wire the attack into the async lifecycle so the executor and tests exercise the integration path.

Keep this change reviewable and focused avoiding a large PR that combines API changes, orchestration logic, scoring, and tests all at once.

Splitting the work makes it easier to iterate on naming/shape and get early feedback on the integration points. I will follow up with one or more focused commits that implement the HCoT prompt generation, model orchestration, and scoring logic.

Relation to issue #897

Issue #897 requests an HCoT (Hijacking Chain-of-Thought) attack implementation. This PR implements the scaffolding required by that issue (class + async entrypoint + test wiring). The attack-specific prompt engineering and scoring behavior requested in #897 will be implemented in subsequent PR(s).

Local test status & CI notes

The new scaffold tests for this attack pass locally.

Some broader integration tests remain skipped locally because they require external dependencies or environment variables (for example, OpenAI endpoints and Azure OpenAI credentials). I did not change those integrations in this PR; they remain behind their existing opt-ins/flags and environment checks.

The intent is to keep this PR independent of external secrets and infrastructure so reviewers can focus on API shape and integration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is very much still in "draft" stage and I'll mark it accordingly. The PR requires the actual attack logic, of course, before we can consider this for merging.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @romanlutz. I agree and thanks for marking this as a draft.

My goal with this PR was to land the scaffolding (HCoTAttack class, the async _perform_async entrypoint, and the minimal test wiring) so maintainers and reviewers like you can weigh in on the public API shape and how the executor integrates the attack before I add the full HCoT logic. I intentionally left _perform_async as a placeholder (returns UNDETERMINED) to avoid bundling API, logic, and test changes in one large commit.

I’ll follow up with one or more subsequent commits/PRs implementing the attack prompt orchestration, scoring, and any additional tests necessary to get this ready for merge. If you have specific suggestions for the public surface or test scaffolding, I can incorporate them in the next iteration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on it being an executor and from what I understand about the attack it should probably be single turn. The rest likely depends on the rest of the implementation 🙂 Thanks for contributing!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @romanlutz I appreciate the quick review and the guidance. I’ll implement HCoT as a single-turn executor for the initial iteration and follow up with a focused PR that adds the prompt orchestration and core attack logic, success/detection heuristics with configurable scoring thresholds, and expanded tests (including mocked model responses for success/failure cases). I’ll also include docstrings and a short usage example so the public API shape is clear; if you have any specific naming, test patterns, or configuration preferences I should follow, I’ll incorporate them up front.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @romanlutz, I’ve updated the PR to implement HCoTAttack as a single-turn executor using SingleTurnAttackStrategy. The attack now supports a configurable score_threshold, integrates with a model client so it works with DummyModelClient (and similar), and includes tests covering success, failure, and undetermined cases, all of which are passing. This keeps the scope focused as an initial scaffold, with more advanced orchestration, scoring heuristics, and documentation planned for a follow-up PR. Please let me know if you’d like any adjustments in this PR or if those should be handled in subsequent iterations.

@romanlutz romanlutz changed the title Added HCoT (Hijacking Chain-of-Thought) Attack (Fixes #897) [DRAFT] Added HCoT (Hijacking Chain-of-Thought) Attack (Fixes #897) Sep 26, 2025
@romanlutz romanlutz marked this pull request as draft September 26, 2025 18:13
input_text = getattr(context, "objective", "") or ""

if not input_text:
return AttackResult(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a reason to raise a ValueError

logger = logging.getLogger(__name__)


class HCoTAttack(SingleTurnAttackStrategy):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need class and constructor docstrings as well as for every other method

return context

async def _perform_async(self, context: SingleTurnAttackContext) -> AttackResult: # type: ignore
input_text = getattr(context, "objective", "") or ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the careful approach with getattr? It's definitely there. It may be an empty string, yes, but no need to do this?

)

# Get model client from context
model_client = getattr(context, "model_client", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this model_client? Apologies if I'm wrong here but reading this gives me the feeling that I'm debugging AI-generated code from an AI coding agent that isn't aware of the whole codebase and therefore has to make stuff up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @romanlutz, the earlier use of model_client in the context was not AI-generated code, but rather my own initial attempt to integrate the model call into the attack flow. After reviewing your feedback, I recognize that this approach introduced an attribute that didn’t belong in the context and could cause confusion. To resolve this, I’ve removed model_client from the context and switched to explicit generator injection through the constructor. This makes the dependency clearer, avoids introducing unrelated attributes, and better aligns with the design principles of the codebase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the point is we don't need this model client or generator because we have PromptTarget in PyRIT. Please check with existing implementations of executors and feel free to copy patterns from there.

@romanlutz romanlutz self-assigned this Sep 28, 2025
@raghavendrakari
Copy link
Author

Thanks for the detailed feedback, @romanlutz. I’ve updated the PR to incorporate your suggestions. The executor now raises a ValueError when the objective is missing instead of returning an AttackResult, which makes it fail fast on invalid input. I also added docstrings for the class, constructor, and all methods to improve readability and consistency with the rest of the codebase. Access to the context has been simplified by using context.objective directly rather than getattr, since the attribute is guaranteed to be present. Finally, I removed the ad-hoc model_client from the context and switched to explicit generator injection through the constructor, which makes the dependency clearer and avoids introducing attributes that don’t belong in the context. I also updated the integration tests to reflect this change, and everything is passing locally. Please let me know if you’d like any further refinements.

@raghavendrakari raghavendrakari marked this pull request as ready for review September 29, 2025 23:18
@raghavendrakari
Copy link
Author

Thanks for the earlier guidance. I updated the implementation to follow the repository patterns. The HCoT attack now strictly uses a PromptTarget, raises ValueError when objective is missing, and includes class/method docstrings. I also updated the integration tests to use a small FakePromptTarget; all three tests pass locally. The branch is pushed (latest commit 9845a59) please let me know if you want the PromptTarget usage tightened further to a specific concrete implementation.

@romanlutz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support H-CoT: Hijacking the Chain-of-Thought to Jailbreak Reasoning Models
2 participants