Skip to content

LFM2 tool calling: parser and constrainer fixes, BFCL simple bench#595

Closed
ncylich wants to merge 4 commits intomainfrom
lfm-tool-fixes
Closed

LFM2 tool calling: parser and constrainer fixes, BFCL simple bench#595
ncylich wants to merge 4 commits intomainfrom
lfm-tool-fixes

Conversation

@ncylich
Copy link
Copy Markdown
Collaborator

@ncylich ncylich commented Apr 17, 2026

Summary

  • Parser fixes in cactus/ffi/cactus_utils.h so LFM2 Python-style tool calls round-trip to JSON correctly: drop a phantom empty-name tool when a schema has a property literally named function, bracket-match unquoted values so interval=[1,3] isn't cut to [1, and preserve the original quoted-ness so radius=4 stays a number instead of becoming the string "4".
  • Constrainer fix for LFM2's LFM_IN_FUNC_NAME state: paren_open_tokens_ was populated only by encode("("), missing BPE-merged tokens like () and (. Added a reusable add_tokens_for_prefix_string helper that scans the vocab for every token whose decoded form starts with a given prefix, and use it to round out the set so no paren-carrying token can slip through while the name is still being emitted.
  • Minimal single-file BFCL benchmark at tests/test_bfcl_simple.cpp (self-contained — only includes cactus_ffi.h and std headers). Embeds 20 simple_python cases, runs them through cactus_complete, and reports pass rate. Reads model path from CACTUS_TEST_MODEL.

Bench numbers

Model Before After
LFM2.5 350m simple_python 0/20 16/20
Gemma 4 E2B simple_python 19/20 19/20 (unchanged)

Test plan

  • CACTUS_TEST_MODEL=weights/lfm2.5-350m tests/build/test_bfcl_simple → 15–16/20 pass
  • CACTUS_TEST_MODEL=weights/gemma-4-e2b-it tests/build/test_bfcl_simple → 19/20 pass
  • Existing test_gemma4_tools_bench regressions: none (Gemma 4 numbers unchanged)

ncylich added 4 commits April 17, 2026 11:51
Three small bugs in cactus_utils.h surfaced while running BFCL
simple_python against LFM2.5-350m (0/20 -> 11/20):

1. parse_tools_json pushed a phantom empty-name tool whenever an OpenAI
   schema used "function" as a property name inside parameters.properties
   (common for math / calculus tools). Guard the push with
   !tool.name.empty().

2. append_lfm2_call stopped an unquoted value at the first comma, which
   broke list / dict arguments like interval=[1,3] (parsed as just
   [1). Bracket-match instead.

3. append_lfm2_call always re-wrapped values in quotes, turning
   radius=4 into "radius":"4" and failing BFCL's type-strict integer
   check. Preserve the original quoted-ness: quoted values stay strings,
   unquoted values (numbers, bools, containers) pass through as-is.

Gemma 4 tool-calling pass rates unchanged (19/20 simple_python, 7/20
live_multiple).

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Once the parser no longer hands the constrainer an empty-name phantom
(previous commit), LFM_IN_FUNC_NAME reliably waits for the full
function name, but `(` can still slip in because `paren_open_tokens_`
was populated only by `add_tokens_for_string("(")`, which returns the
single BPE token for `(` in isolation. Merged tokens like `()` and
` (` go through unscathed, and the model occasionally picks one of
those right after a partial name, ending the name early.

Fix: add an `add_tokens_for_prefix_string` helper that scans the vocab
and adds every token whose decoded form starts with the given prefix.
Call it once for `(` to round out `paren_open_tokens_`, and add that
set to the block list in `LFM_IN_FUNC_NAME` so the state actually
prevents all paren-introducing tokens, not just the standalone `(`.

Scoped to the LFM2 branch of `tokenize_grammar_elements`; other model
families are untouched.

LFM2.5-350m simple_python: 11/20 -> 16/20. live_multiple / Gemma 4
unchanged.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Copilot AI review requested due to automatic review settings April 17, 2026 18:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves LFM2 tool-calling reliability across the stack (schema parsing → tool-call parsing → token-level constraints) and adds a minimal BFCL “simple_python” benchmark executable to quantify tool-call correctness.

Changes:

  • Fix tool schema parsing and LFM2 Python-style tool-call argument parsing to round-trip into valid JSON more accurately.
  • Expand LFM2 tool-call constraining to account for BPE-merged tokens related to parentheses during function-name emission.
  • Add a self-contained BFCL “simple_python” benchmark runner under tests/ that executes 20 embedded cases and reports pass rate.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
tests/test_bfcl_simple.cpp Adds a single-file BFCL “simple_python” benchmark executable that runs 20 cases via cactus_complete.
cactus/ffi/cactus_utils.h Adjusts tool schema parsing to skip empty-name phantom tools; improves LFM2 tool-call arg parsing (bracket matching + preserve quoted-ness).
cactus/engine/engine_constraints.cpp Adds a vocab scan helper for prefix-matching tokens and uses it to strengthen LFM2 paren token sets; blocks paren-open tokens in LFM_IN_FUNC_NAME.
cactus/engine/engine.h Declares the new ToolCallConstrainer::add_tokens_for_prefix_string helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +416 to +419
std::printf("%s\nPass rate: %zu/%zu (%.1f%%)\n",
std::string(60, '-').c_str(), passed, case_count,
100.0 * static_cast<double>(passed) / static_cast<double>(case_count));
return passed == case_count ? 0 : 1;
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This executable exits non-zero unless all cases pass (return passed == case_count ? 0 : 1). Since tests/CMakeLists.txt builds every test_*.cpp and tests/run.sh runs every test_* binary by default with CACTUS_TEST_MODEL set, this will make the default test run fail for models that aren’t 20/20 (the PR description reports 16/20 and 19/20). Consider either (a) renaming this to a non-test_* target (bench) so it’s opt-in, or (b) gating with an env var / using a threshold-based exit code.

Copilot uses AI. Check for mistakes.

tokenize_function_names(false);
tokenize_function_names(false);
add_tokens_for_prefix_string("(", paren_open_tokens_);
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_tokens_for_prefix_string("(", ...) only matches tokens whose decoded form starts with '('. Tokens like ' (' (leading space) or other merged forms where '(' is not at position 0 won’t be included, so they can still slip through during LFM_IN_FUNC_NAME despite the intent described in the PR. If the goal is “block any token that contains '(' while emitting the name”, consider scanning for tokens whose decoded form contains '(' (or adding prefix variants for the tokenizer’s whitespace marker, e.g. " (" / "▁(" / "Ġ(").

Suggested change
add_tokens_for_prefix_string("(", paren_open_tokens_);
add_tokens_for_prefix_string("(", paren_open_tokens_);
add_tokens_for_prefix_string(" (", paren_open_tokens_);
add_tokens_for_prefix_string("▁(", paren_open_tokens_);
add_tokens_for_prefix_string("Ġ(", paren_open_tokens_);

Copilot uses AI. Check for mistakes.
Comment on lines +17 to +21
void ToolCallConstrainer::add_tokens_for_prefix_string(const std::string& prefix, std::unordered_set<uint32_t>& token_set) {
if (!tokenizer_) return;
const uint32_t vocab = tokenizer_->get_vocab_size();
for (uint32_t t = 0; t < vocab; t++) {
if (tokenizer_->decode({t}).rfind(prefix, 0) == 0) {
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_tokens_for_prefix_string() linearly scans the entire vocab and calls tokenizer_->decode({t}) for every token. ToolCallConstrainer::init() is invoked from Model::set_tool_constraints() per request, so this can add significant latency when tool constraints are enabled. Consider caching decoded token strings / precomputing the prefix->token_set once per tokenizer (or exposing a zero-allocation vocab string accessor) so repeated calls don’t rescan+decode the full vocab.

Copilot uses AI. Check for mistakes.

tokenize_function_names(false);
tokenize_function_names(false);
add_tokens_for_prefix_string("(", paren_open_tokens_);
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_tokens_for_prefix_string("(", paren_open_tokens_) will add all tokens whose decoded text starts with "(", including multi-character merges like "()" or "(foo". In LFM_EXPECT_PAREN you later FORCE_BIAS every token in paren_open_tokens_, which can cause the model to emit a merged token like "()" and skip the ) state transition (since LFM_EXPECT_PAREN only checks for '('). Consider keeping a separate “exact open-paren tokens” set for forcing, and use a broader set only for blocking during LFM_IN_FUNC_NAME (or explicitly filter out tokens whose decoded form contains ')').

Suggested change
add_tokens_for_prefix_string("(", paren_open_tokens_);

Copilot uses AI. Check for mistakes.
Comment thread cactus/ffi/cactus_utils.h
Comment on lines 1460 to +1462
if (!first_arg) json_call += ",";
json_call += "\"" + arg_name + "\":\"" + arg_value + "\"";
json_call += "\"" + arg_name + "\":";
json_call += quoted ? ("\"" + arg_value + "\"") : arg_value;
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When quoted == false, arg_value is appended verbatim into JSON. This is correct for JSON primitives/arrays/objects, but will produce invalid JSON if the model emits an unquoted string identifier (e.g. unit=inches -> "unit":inches). Consider validating arg_value as a JSON literal (number/true/false/null/{/[) and otherwise quoting+escaping it so append_lfm2_call() always produces valid JSON.

Copilot uses AI. Check for mistakes.
@ncylich ncylich closed this Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants