<feat WIP>: augmenting mmlu #29

SemyonEpanov · 2025-09-23T18:31:11Z

No description provided.

fxlrnrpt · 2025-09-24T11:44:21Z

src/core/distillation/distill.py

    if field_response not in df.columns:
        df[field_response] = ""
-    if field_response not in df.columns:
+    if field_ans not in df.columns:


Great catch!

fxlrnrpt · 2025-09-24T11:45:26Z

src/core/prompts/mmlu_branches_aug.py

+            Return JSON only with:
+            {{"answer":"{letters}","rationale":"1-3 sentences (concise)","key_steps":["fact1","fact2","fact3"]}}
+
+            Answer the MCQ briefly and factually (no step-by-step reasoning).


Why? I thought we wanted to elicit step-by-step reasoning

If we use step-by-step, it makes sense to use thinking, which will be very expensive on the mmlu-pro (try changing the prompt and setting the -1 flag for experiment)

i mean

thinking_config=types.ThinkingConfig(thinking_budget=0)

I guess we need to better align on the goal of the experiment then first. Could you add a design doc to docs with: hypothesis, execution plan, expected resutls

fxlrnrpt · 2025-09-24T11:48:26Z

src/core/distillation/synth-aug-mmlu.py

+        },
+    }
+
+def process_tsv(tsv_path, out_jsonl, limit=None):


Unify it with the existing distill_on_dataset?

Yes, we can, but, I think the CLI call is more useful.

Shall we use it as a function call from a script in experiements instead for a plain CLI call then?

…lexity-aware-fine-tuning into feat/data-aug-think

SemyonEpanov · 2025-09-28T19:54:16Z

I accidentally renamed the file and made changes in one commit (sorry about that).

In general: I changed gemini to openrouter, changed the logic to step-by-step reasoning, and left key_steps as the summation.

In the future, I want to merge branch a into branch c, since the first part of c duplicates a

fxlrnrpt · 2025-09-29T09:18:02Z

docs/explain-mmlu.md

@@ -0,0 +1,17 @@
+1) **Main point**
+
+Obtain a synthetic dataset (answers + brief explanations + analysis of erroneous answers + CoT tokens) for training subsequent models.


We actually want to elicit the full reasoning chain, don't we?

Could you also add why we want to do it? AFAIU, we want to fine-tune small models on different versions of the distilled CoT and compare the performance. Right?

fxlrnrpt · 2025-09-29T09:19:26Z

src/core/distillation/synth_aug_mmlu.py

+from core.prompts.mmlu_branches_aug import *
+
+# defaults
+DEFAULT_MODEL = os.getenv("OPENROUTER_MODEL", "deepseek/deepseek-r1:free")


Shall we pass it as config? As discussed during before, we want reproducible results and it is extremely easy to forget what options we used if we pass them as env or CLI args

It's just a default argument in the file, and the function itself accepts and works with arguments. Therefore, the use of the config is the responsibility of the person using the code.

def synth_on_dataset( in_filename: str, out_jsonl: str, model: str = DEFAULT_MODEL, max_tokens: int = DEFAULT_MAX_TOKENS, dump_every: int = DUMP_EVERY, limit: int | None = None, branches: tuple[str, ...] = DEFAULT_BRANCHES ):

fxlrnrpt · 2025-09-29T09:21:33Z

src/core/distillation/synth_aug_mmlu.py

+CHUNK_SIZE = int(os.getenv("SYNTH_CHUNK_SIZE", "16"))
+DUMP_EVERY = int(os.getenv("SYNTH_DUMP_EVERY", "10"))
+
+ALL_LETTERS = [chr(c) for c in range(ord("A"), ord("Z")+1)]


Re-use what we have in https://github.com/LabARSS/reasoning-fine-tune/blob/85cc151cdfcac6a5ec409a9f2583486318fe7ed0/src/reasoning_fine_tune/prompts/mmlu_single_token_answer.py#L34?
Extract it in a separate file for better readibility?

Let's discuss this in a conference call.

fxlrnrpt · 2025-09-29T09:24:01Z

src/core/distillation/synth_aug_mmlu.py

+                pass
+    j = j or {}
+
+    if reasoning_text and "thinking" not in j:


Could you help me understand what we are doing here?

Validation of the received json from LLM

fxlrnrpt · 2025-09-29T09:26:58Z

src/core/distillation/synth_aug_mmlu.py

+        record_in = _build_record_in(row_dict, question, choices, letters, gold, model)
+        jobs.append((row.Index, "A", {"question": question, "choices": choices, "gold": gold, "record_in": record_in, "letters": letters}))
+        jobs.append((row.Index, "B", {"question": question, "choices": choices, "gold": gold, "record_in": record_in, "letters": letters}))
+        jobs.append((row.Index, "C", {"question": question, "choices": choices, "gold": gold, "record_in": record_in, "letters": letters}))


Should we first get A and B? And then run C on top of A as you propose din the chat before?

Yes, that's right. In the new version, it's refactored

fxlrnrpt · 2025-09-29T09:29:19Z

src/core/prompts/mmlu_branches_aug.py

+Return JSON ONLY with the following schema:
+{{
+  "answer": "{letters}",
+  "rationale": "concise 1-2 sentence justification (no fluff)",


Shall we ask for the final answer straight away if we are using a reasoning model and we can extract its reasoning chain?

fxlrnrpt · 2025-09-29T09:30:25Z

src/core/prompts/mmlu_branches_aug.py

+        Return JSON only:
+        {{"correct_answer":"{letters}",
+        "why_correct": "step-by-step reasoning showing why the gold option is correct",
+        "distractor_analysis": {distractor_tpl} }}


Could you help me understand distractor_analysis vs why_correct?

Distractor_analysis explains all answer options. Why_correct explains the correct answer

Semyon Epanov added 2 commits September 23, 2025 21:00

<feat [WIP]>: augmenting mmlu

e5da774

fix distill and add prompts

933d0c0

fxlrnrpt reviewed Sep 24, 2025

View reviewed changes

SemyonEpanov added 4 commits September 27, 2025 21:50

design_doc for synth-mmlu data

89982f2

design_doc for synth-mmlu data

621c9aa

Merge branch 'feat/data-aug-think' of https://github.com/LabARSS/comp…

ed2981d

…lexity-aware-fine-tuning into feat/data-aug-think

removed CLI; replace gemini with openrouter

241ec6c

fxlrnrpt reviewed Sep 29, 2025

View reviewed changes

added an example and added a branch selection

4c5f8fe

		@@ -0,0 +1,17 @@
		1) Main point

		Obtain a synthetic dataset (answers + brief explanations + analysis of erroneous answers + CoT tokens) for training subsequent models.

<feat WIP>: augmenting mmlu #29

Are you sure you want to change the base?

<feat WIP>: augmenting mmlu #29

Uh oh!

Conversation

SemyonEpanov commented Sep 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SemyonEpanov commented Sep 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!