Skip to content

Conversation

jbolor21
Copy link
Contributor

@jbolor21 jbolor21 commented Sep 30, 2025

Description

Making it so we can query attack_results by harm_categories and memory labels. This value is currently present in seed prompts but was not queryable for attack results. To do this I made a few changes:

  • Adding Harm_Categories to PromptRequestPieces
  • Adding harm_categories into get_attack_results() so we can query by harm_categories for attacks. This is built on logic joining two datatables together
  • Adding labels into get_attack_results() similar to above
  • A notebook to demonstrate how to query by harm_category, and added this into the prompt sending cookbook as well
  • Updated illegal.yaml to change the money laundering prompt instead to a violent prompt so we can easily demonstrate our queries for both single and multiple harm categories.

Tests and Documentation

Ran notebooks, added new unit tests

@jbolor21 jbolor21 marked this pull request as draft September 30, 2025 21:45
@romanlutz
Copy link
Contributor

This is a good start! I think we should also have an example showing how to query by harm category within a specific op label, and the memory code needs a join between prompt memory entry and attack results to check for all results with a certain harm category in the pieces.

@jbolor21 jbolor21 marked this pull request as ready for review October 2, 2025 20:08
@jbolor21 jbolor21 changed the title [DRAFT] FEAT: Adding Harm Categories to Prompt Request Pieces FEAT: Adding Harm Categories to Prompt Request Pieces Oct 3, 2025
@hannahwestra25
Copy link
Contributor

nice work! made a few small comments but overall looks good!

Copy link
Contributor

@hannahwestra25 hannahwestra25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two small comments on the comments :)

@jbolor21 jbolor21 merged commit c0fb5cd into Azure:main Oct 10, 2025
19 checks passed
@jbolor21 jbolor21 deleted the users/bjagdagdorj/harm_categories branch October 10, 2025 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants