-
Notifications
You must be signed in to change notification settings - Fork 3
<feat WIP>: augmenting mmlu #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
if field_response not in df.columns: | ||
df[field_response] = "" | ||
if field_response not in df.columns: | ||
if field_ans not in df.columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch!
Return JSON only with: | ||
{{"answer":"{letters}","rationale":"1-3 sentences (concise)","key_steps":["fact1","fact2","fact3"]}} | ||
Answer the MCQ briefly and factually (no step-by-step reasoning). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? I thought we wanted to elicit step-by-step reasoning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use step-by-step, it makes sense to use thinking, which will be very expensive on the mmlu-pro (try changing the prompt and setting the -1 flag for experiment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i mean
thinking_config=types.ThinkingConfig(thinking_budget=0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we need to better align on the goal of the experiment then first. Could you add a design doc to docs
with: hypothesis, execution plan, expected resutls
}, | ||
} | ||
|
||
def process_tsv(tsv_path, out_jsonl, limit=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unify it with the existing distill_on_dataset
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can, but, I think the CLI call is more useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use it as a function call from a script in experiements
instead for a plain CLI call then?
I accidentally renamed the file and made changes in one commit (sorry about that). In general: I changed gemini to openrouter, changed the logic to step-by-step reasoning, and left key_steps as the summation. In the future, I want to merge branch a into branch c, since the first part of c duplicates a |
@@ -0,0 +1,17 @@ | |||
1) **Main point** | |||
|
|||
Obtain a synthetic dataset (answers + brief explanations + analysis of erroneous answers + CoT tokens) for training subsequent models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually want to elicit the full reasoning chain, don't we?
Could you also add why we want to do it? AFAIU, we want to fine-tune small models on different versions of the distilled CoT and compare the performance. Right?
from core.prompts.mmlu_branches_aug import * | ||
|
||
# defaults | ||
DEFAULT_MODEL = os.getenv("OPENROUTER_MODEL", "deepseek/deepseek-r1:free") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we pass it as config? As discussed during before, we want reproducible results and it is extremely easy to forget what options we used if we pass them as env or CLI args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a default argument in the file, and the function itself accepts and works with arguments. Therefore, the use of the config is the responsibility of the person using the code.
def synth_on_dataset(
in_filename: str,
out_jsonl: str,
model: str = DEFAULT_MODEL,
max_tokens: int = DEFAULT_MAX_TOKENS,
dump_every: int = DUMP_EVERY,
limit: int | None = None,
branches: tuple[str, ...] = DEFAULT_BRANCHES
):
CHUNK_SIZE = int(os.getenv("SYNTH_CHUNK_SIZE", "16")) | ||
DUMP_EVERY = int(os.getenv("SYNTH_DUMP_EVERY", "10")) | ||
|
||
ALL_LETTERS = [chr(c) for c in range(ord("A"), ord("Z")+1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-use what we have in https://github.com/LabARSS/reasoning-fine-tune/blob/85cc151cdfcac6a5ec409a9f2583486318fe7ed0/src/reasoning_fine_tune/prompts/mmlu_single_token_answer.py#L34?
Extract it in a separate file for better readibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss this in a conference call.
pass | ||
j = j or {} | ||
|
||
if reasoning_text and "thinking" not in j: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me understand what we are doing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validation of the received json from LLM
record_in = _build_record_in(row_dict, question, choices, letters, gold, model) | ||
jobs.append((row.Index, "A", {"question": question, "choices": choices, "gold": gold, "record_in": record_in, "letters": letters})) | ||
jobs.append((row.Index, "B", {"question": question, "choices": choices, "gold": gold, "record_in": record_in, "letters": letters})) | ||
jobs.append((row.Index, "C", {"question": question, "choices": choices, "gold": gold, "record_in": record_in, "letters": letters})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we first get A and B? And then run C on top of A as you propose din the chat before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right. In the new version, it's refactored
Return JSON ONLY with the following schema: | ||
{{ | ||
"answer": "{letters}", | ||
"rationale": "concise 1-2 sentence justification (no fluff)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we ask for the final answer straight away if we are using a reasoning model and we can extract its reasoning chain?
Return JSON only: | ||
{{"correct_answer":"{letters}", | ||
"why_correct": "step-by-step reasoning showing why the gold option is correct", | ||
"distractor_analysis": {distractor_tpl} }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me understand distractor_analysis
vs why_correct
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Distractor_analysis explains all answer options. Why_correct explains the correct answer
No description provided.