LFM2 tool calling: parser and constrainer fixes, BFCL simple bench by ncylich · Pull Request #595 · cactus-compute/cactus

ncylich · 2026-04-17T18:52:24Z

Summary

Parser fixes in cactus/ffi/cactus_utils.h so LFM2 Python-style tool calls round-trip to JSON correctly: drop a phantom empty-name tool when a schema has a property literally named function, bracket-match unquoted values so interval=[1,3] isn't cut to [1, and preserve the original quoted-ness so radius=4 stays a number instead of becoming the string "4".
Constrainer fix for LFM2's LFM_IN_FUNC_NAME state: paren_open_tokens_ was populated only by encode("("), missing BPE-merged tokens like () and (. Added a reusable add_tokens_for_prefix_string helper that scans the vocab for every token whose decoded form starts with a given prefix, and use it to round out the set so no paren-carrying token can slip through while the name is still being emitted.
Minimal single-file BFCL benchmark at tests/test_bfcl_simple.cpp (self-contained — only includes cactus_ffi.h and std headers). Embeds 20 simple_python cases, runs them through cactus_complete, and reports pass rate. Reads model path from CACTUS_TEST_MODEL.

Bench numbers

Model	Before	After
LFM2.5 350m simple_python	0/20	16/20
Gemma 4 E2B simple_python	19/20	19/20 (unchanged)

Test plan

CACTUS_TEST_MODEL=weights/lfm2.5-350m tests/build/test_bfcl_simple → 15–16/20 pass
CACTUS_TEST_MODEL=weights/gemma-4-e2b-it tests/build/test_bfcl_simple → 19/20 pass
Existing test_gemma4_tools_bench regressions: none (Gemma 4 numbers unchanged)

Three small bugs in cactus_utils.h surfaced while running BFCL simple_python against LFM2.5-350m (0/20 -> 11/20): 1. parse_tools_json pushed a phantom empty-name tool whenever an OpenAI schema used "function" as a property name inside parameters.properties (common for math / calculus tools). Guard the push with !tool.name.empty(). 2. append_lfm2_call stopped an unquoted value at the first comma, which broke list / dict arguments like interval=[1,3] (parsed as just [1). Bracket-match instead. 3. append_lfm2_call always re-wrapped values in quotes, turning radius=4 into "radius":"4" and failing BFCL's type-strict integer check. Preserve the original quoted-ness: quoted values stay strings, unquoted values (numbers, bools, containers) pass through as-is. Gemma 4 tool-calling pass rates unchanged (19/20 simple_python, 7/20 live_multiple). Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Once the parser no longer hands the constrainer an empty-name phantom (previous commit), LFM_IN_FUNC_NAME reliably waits for the full function name, but `(` can still slip in because `paren_open_tokens_` was populated only by `add_tokens_for_string("(")`, which returns the single BPE token for `(` in isolation. Merged tokens like `()` and ` (` go through unscathed, and the model occasionally picks one of those right after a partial name, ending the name early. Fix: add an `add_tokens_for_prefix_string` helper that scans the vocab and adds every token whose decoded form starts with the given prefix. Call it once for `(` to round out `paren_open_tokens_`, and add that set to the block list in `LFM_IN_FUNC_NAME` so the state actually prevents all paren-introducing tokens, not just the standalone `(`. Scoped to the LFM2 branch of `tokenize_grammar_elements`; other model families are untouched. LFM2.5-350m simple_python: 11/20 -> 16/20. live_multiple / Gemma 4 unchanged. Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Copilot

Pull request overview

This PR improves LFM2 tool-calling reliability across the stack (schema parsing → tool-call parsing → token-level constraints) and adds a minimal BFCL “simple_python” benchmark executable to quantify tool-call correctness.

Changes:

Fix tool schema parsing and LFM2 Python-style tool-call argument parsing to round-trip into valid JSON more accurately.
Expand LFM2 tool-call constraining to account for BPE-merged tokens related to parentheses during function-name emission.
Add a self-contained BFCL “simple_python” benchmark runner under tests/ that executes 20 embedded cases and reports pass rate.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
`tests/test_bfcl_simple.cpp`	Adds a single-file BFCL “simple_python” benchmark executable that runs 20 cases via `cactus_complete`.
`cactus/ffi/cactus_utils.h`	Adjusts tool schema parsing to skip empty-name phantom tools; improves LFM2 tool-call arg parsing (bracket matching + preserve quoted-ness).
`cactus/engine/engine_constraints.cpp`	Adds a vocab scan helper for prefix-matching tokens and uses it to strengthen LFM2 paren token sets; blocks paren-open tokens in `LFM_IN_FUNC_NAME`.
`cactus/engine/engine.h`	Declares the new `ToolCallConstrainer::add_tokens_for_prefix_string` helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T18:59:04Z

+    std::printf("%s\nPass rate: %zu/%zu (%.1f%%)\n",
+                std::string(60, '-').c_str(), passed, case_count,
+                100.0 * static_cast<double>(passed) / static_cast<double>(case_count));
+    return passed == case_count ? 0 : 1;


This executable exits non-zero unless all cases pass (return passed == case_count ? 0 : 1). Since tests/CMakeLists.txt builds every test_*.cpp and tests/run.sh runs every test_* binary by default with CACTUS_TEST_MODEL set, this will make the default test run fail for models that aren’t 20/20 (the PR description reports 16/20 and 19/20). Consider either (a) renaming this to a non-test_* target (bench) so it’s opt-in, or (b) gating with an env var / using a threshold-based exit code.

Copilot · 2026-04-17T18:59:05Z


-        tokenize_function_names(false);  
+        tokenize_function_names(false);
+        add_tokens_for_prefix_string("(", paren_open_tokens_);


add_tokens_for_prefix_string("(", ...) only matches tokens whose decoded form starts with '('. Tokens like ' (' (leading space) or other merged forms where '(' is not at position 0 won’t be included, so they can still slip through during LFM_IN_FUNC_NAME despite the intent described in the PR. If the goal is “block any token that contains '(' while emitting the name”, consider scanning for tokens whose decoded form contains '(' (or adding prefix variants for the tokenizer’s whitespace marker, e.g. " (" / "▁(" / "Ġ(").

Suggested change

add_tokens_for_prefix_string("(", paren_open_tokens_);

add_tokens_for_prefix_string("(", paren_open_tokens_);

add_tokens_for_prefix_string(" (", paren_open_tokens_);

add_tokens_for_prefix_string("▁(", paren_open_tokens_);

add_tokens_for_prefix_string("Ġ(", paren_open_tokens_);

Copilot · 2026-04-17T18:59:05Z

+void ToolCallConstrainer::add_tokens_for_prefix_string(const std::string& prefix, std::unordered_set<uint32_t>& token_set) {
+    if (!tokenizer_) return;
+    const uint32_t vocab = tokenizer_->get_vocab_size();
+    for (uint32_t t = 0; t < vocab; t++) {
+        if (tokenizer_->decode({t}).rfind(prefix, 0) == 0) {


add_tokens_for_prefix_string() linearly scans the entire vocab and calls tokenizer_->decode({t}) for every token. ToolCallConstrainer::init() is invoked from Model::set_tool_constraints() per request, so this can add significant latency when tool constraints are enabled. Consider caching decoded token strings / precomputing the prefix->token_set once per tokenizer (or exposing a zero-allocation vocab string accessor) so repeated calls don’t rescan+decode the full vocab.

Copilot · 2026-04-17T18:59:05Z


-        tokenize_function_names(false);  
+        tokenize_function_names(false);
+        add_tokens_for_prefix_string("(", paren_open_tokens_);


add_tokens_for_prefix_string("(", paren_open_tokens_) will add all tokens whose decoded text starts with "(", including multi-character merges like "()" or "(foo". In LFM_EXPECT_PAREN you later FORCE_BIAS every token in paren_open_tokens_, which can cause the model to emit a merged token like "()" and skip the ) state transition (since LFM_EXPECT_PAREN only checks for '('). Consider keeping a separate “exact open-paren tokens” set for forcing, and use a broader set only for blocking during LFM_IN_FUNC_NAME (or explicitly filter out tokens whose decoded form contains ')').

Suggested change

add_tokens_for_prefix_string("(", paren_open_tokens_);

Copilot · 2026-04-17T18:59:06Z

        if (!first_arg) json_call += ",";
-        json_call += "\"" + arg_name + "\":\"" + arg_value + "\"";
+        json_call += "\"" + arg_name + "\":";
+        json_call += quoted ? ("\"" + arg_value + "\"") : arg_value;


When quoted == false, arg_value is appended verbatim into JSON. This is correct for JSON primitives/arrays/objects, but will produce invalid JSON if the model emits an unquoted string identifier (e.g. unit=inches -> "unit":inches). Consider validating arg_value as a JSON literal (number/true/false/null/{/[) and otherwise quoting+escaping it so append_lfm2_call() always produces valid JSON.

ncylich added 4 commits April 17, 2026 11:51

simple bfcl test

2499d28

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

embedding tests directly

e6ea57d

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Copilot AI review requested due to automatic review settings April 17, 2026 18:52

Copilot started reviewing on behalf of ncylich April 17, 2026 18:52 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

ncylich closed this Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LFM2 tool calling: parser and constrainer fixes, BFCL simple bench#595

LFM2 tool calling: parser and constrainer fixes, BFCL simple bench#595
ncylich wants to merge 4 commits intomainfrom
lfm-tool-fixes

ncylich commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        add_tokens_for_prefix_string("(", paren_open_tokens_);
+        add_tokens_for_prefix_string("(", paren_open_tokens_);
+        add_tokens_for_prefix_string(" (", paren_open_tokens_);
+        add_tokens_for_prefix_string("▁(", paren_open_tokens_);
+        add_tokens_for_prefix_string("Ġ(", paren_open_tokens_);

Conversation

ncylich commented Apr 17, 2026

Summary

Bench numbers

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants