Skip to content

Commit 27589fa

Browse files
committed
PR changes
1 parent 17c9bd9 commit 27589fa

File tree

5 files changed

+468
-370
lines changed

5 files changed

+468
-370
lines changed

README.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,3 +88,46 @@ python -m src.agentic_capability_generator
8888
# Generate tasks for each capability
8989
python -m src.agentic_task_generator
9090
```
91+
92+
### Wikipedia-Based Analysis Tools
93+
94+
Tools for extracting, processing, and matching mathematical capabilities from Wikipedia. All prompts are centralized in `wikipedia/prompts.py`.
95+
96+
#### Wikipedia Glossary Scraper
97+
98+
Scrapes Wikipedia's "Glossary of areas of mathematics", extracts capability descriptions, and generates summaries with LLM-powered categorization.
99+
100+
```bash
101+
cd wikipedia
102+
python wikipedia_scraper.py
103+
```
104+
105+
Outputs JSON files to `wikipedia/pages/` containing `capability_name`, `description`, `summary`, `area`, `url`, and `timestamp`.
106+
107+
#### Wikipedia-Generated Capability Matcher
108+
109+
Matches Wikipedia capabilities with generated capabilities using LLM-based similarity analysis. Supports bidirectional matching.
110+
111+
Configure `wikipedia/cfg/wiki_vs_generated.yaml`:
112+
- `data_cfg.wikipedia_pages_dir`: Wikipedia pages directory
113+
- `data_cfg.generated_dir`: Generated capabilities directory
114+
- `processing_cfg.match_direction`: `generated_to_wikipedia` or `wikipedia_to_generated`
115+
116+
```bash
117+
cd wikipedia
118+
python wiki_vs_generated.py
119+
```
120+
121+
#### Dataset Question Categorizer
122+
123+
Categorizes questions from GSM8K or MATH datasets into mathematical areas using generated or Wikipedia taxonomies. Supports checkpoint-based resume.
124+
125+
Configure `wikipedia/cfg/static_vs_generated.yaml`:
126+
- `data_cfg.dataset_name`: `gsm8k` or `math`
127+
- `data_cfg.dataset_path`: Dataset file (GSM8K) or directory (MATH)
128+
- `categorization_cfg.extraction_method`: `generated` or `wikipedia`
129+
130+
```bash
131+
cd wikipedia
132+
python static_vs_generated.py
133+
```

wikipedia/prompts.py

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
"""Centralized prompts for all Wikipedia-related scripts."""
2+
3+
4+
# System prompts
5+
SYSTEM_PROMPT_MATH_CAPABILITIES = "You are an expert in mathematical capabilities."
6+
SYSTEM_PROMPT_MATH_TAXONOMIST = (
7+
"You are an expert mathematical taxonomist. "
8+
"Your task is to map a single math problem to the most appropriate area from a provided list. "
9+
"Output must be EXACTLY one of the given area names, or 'none' if no reasonable match exists. "
10+
"Do not include explanations or extra words."
11+
)
12+
SYSTEM_PROMPT_CAPABILITY_EVALUATION = """You are an expert in mathematics and capability evaluation. Your task is to create concise, informative summaries of mathematical concepts and capabilities.
13+
14+
Given a detailed description of a mathematical concept or capability, provide a clear, concise summary that captures the essential meaning and scope. The summary should be:
15+
- Informative and accurate
16+
- Concise ONLY ONE SENTENCE
17+
- Written in clear, accessible language
18+
- Focused on the core concept and its applications
19+
20+
Examples of good summaries:
21+
- "Capability focusing on field theory, including solving problems related to field extensions, minimal polynomials, and degrees of extensions."
22+
- "Capability that involves solving problems in ring theory including identification of ring properties and operations, testing the structure of rings."
23+
- "Capability that asks the model to simplify algebraic expressions by reducing them to their simplest form. Involves collecting like terms and basic algebraic manipulations."
24+
25+
Provide only the summary, without any additional commentary or formatting."""
26+
SYSTEM_PROMPT_CATEGORIZATION = """You are an expert in mathematics and capability evaluation. Your task is to categorize mathematical concepts and capabilities into one of 10 predefined mathematical areas.
27+
28+
Given a description of a mathematical concept or capability, determine which of the following 10 mathematical areas it best belongs to:
29+
30+
1. Algebra and Functions
31+
2. Arithmetic and Number Theory
32+
3. Calculus and Analysis
33+
4. Differential Equations and Dynamical Systems
34+
5. Discrete Mathematics and Combinatorics
35+
6. Geometry and Spatial Reasoning
36+
7. Linear Algebra and Matrix Theory
37+
8. Mathematical Logic and Set Theory
38+
9. Mathematical Modeling and Applications
39+
10. Probability and Statistics
40+
41+
Return ONLY the exact area name from the list above, nothing else."""
42+
43+
44+
# User prompts - functions that generate user prompts
45+
def get_wikipedia_to_generated_prompt(wikipedia_cap_name: str, wikipedia_cap_description: str, capabilities_list: str) -> str:
46+
"""Generate prompt for matching Wikipedia capability to generated capabilities."""
47+
return f"""You are an expert in mathematical capabilities. Determine which generated capability best matches the given Wikipedia capability.
48+
49+
Wikipedia Capability:
50+
Name: {wikipedia_cap_name}
51+
Description: {wikipedia_cap_description}
52+
53+
Available Generated Capabilities:
54+
{capabilities_list}
55+
56+
Instructions:
57+
- Compare the Wikipedia capability with each available capability.
58+
- Return the exact capability name if ANY of the following is true:
59+
* The Wikipedia capability and the available capability describe the same concept, OR
60+
* The Wikipedia capability is a SUBSET/PART of the available capability (i.e., the available capability includes the Wikipedia capability as one of its components or subskills), OR
61+
* The available capability is a broader superset that clearly contains the Wikipedia capability.
62+
- Prefer the most specific matching capability when multiple candidates qualify.
63+
- Return "none" only if no capability clearly contains or equals the Wikipedia capability.
64+
65+
Answer with only the capability name or "none":"""
66+
67+
68+
def get_generated_to_wikipedia_prompt(generated_cap_name: str, generated_cap_description: str, capabilities_list: str) -> str:
69+
"""Generate prompt for matching generated capability to Wikipedia capabilities."""
70+
return f"""You are an expert in mathematical capabilities. Find the Wikipedia capability that most closely matches the generated capability.
71+
72+
Generated Capability:
73+
Name: {generated_cap_name}
74+
Description: {generated_cap_description}
75+
76+
Available Wikipedia Capabilities:
77+
{capabilities_list}
78+
79+
Instructions:
80+
- Compare the generated capability with each available Wikipedia capability.
81+
- Return the exact Wikipedia capability name if ANY of the following is true:
82+
* The generated capability and the Wikipedia capability describe the same concept, OR
83+
* The generated capability is a SUBSET/PART of the Wikipedia capability (i.e., the Wikipedia capability includes the generated capability as one of its components or subskills), OR
84+
* The Wikipedia capability is a broader superset that clearly contains the generated capability.
85+
- Prefer the most specific matching capability when multiple candidates qualify.
86+
- Return "none" only if no Wikipedia capability clearly contains or equals the generated capability.
87+
88+
Answer with only the Wikipedia capability name or "none":"""
89+
90+
91+
def get_area_categorization_prompt(area_bullets: str, question: str) -> str:
92+
"""Generate prompt for categorizing a question into a mathematical area."""
93+
return f"""Available mathematical areas (choose exactly one):
94+
{area_bullets}
95+
96+
Problem:
97+
{question}
98+
99+
Instructions:
100+
- Return ONLY the exact area name from the list above
101+
- Prefer the closest match even if imperfect; avoid 'none' unless clearly unrelated
102+
- Do not add punctuation or extra text
103+
104+
Answer:"""
105+
106+
107+
def get_capability_summary_prompt(description: str) -> str:
108+
"""Generate prompt for summarizing a mathematical capability."""
109+
return f"Please provide a concise summary of this mathematical concept:\n\n{description}"
110+
111+
112+
def get_capability_categorization_prompt(description: str) -> str:
113+
"""Generate prompt for categorizing a mathematical capability."""
114+
return f"Please categorize this mathematical concept into one of the 10 areas listed above:\n\n{description}"
115+

0 commit comments

Comments
 (0)