Skip to content

Enhancing Reward Function for MCTS in Marco-o1 #13

@johnhaofu

Description

@johnhaofu

The current reward function in Marco-o1's MCTS implementation relies solely on token-level confidence scores derived from the model's output probabilities. While this method provides a straightforward way to evaluate reasoning paths, it has notable limitations:

Local Optimality: Token-level probabilities may lead to paths that seem promising locally but fail to achieve global correctness.
Model Bias: The model's inherent biases might result in overconfidence in certain common patterns, misguiding the search process.
Context Insensitivity: The reward function does not evaluate the logical consistency of the tokens in the broader context of the reasoning path.
Lack of Task-Specificity: The reward function is generic and does not incorporate domain-specific knowledge or logical rules pertinent to the task.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions