Enhancing Reward Function for MCTS in Marco-o1

The current reward function in Marco-o1's MCTS implementation relies solely on token-level confidence scores derived from the model's output probabilities. While this method provides a straightforward way to evaluate reasoning paths, it has notable limitations:

Local Optimality: Token-level probabilities may lead to paths that seem promising locally but fail to achieve global correctness.
Model Bias: The model's inherent biases might result in overconfidence in certain common patterns, misguiding the search process.
Context Insensitivity: The reward function does not evaluate the logical consistency of the tokens in the broader context of the reasoning path.
Lack of Task-Specificity: The reward function is generic and does not incorporate domain-specific knowledge or logical rules pertinent to the task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancing Reward Function for MCTS in Marco-o1 #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enhancing Reward Function for MCTS in Marco-o1 #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions