LFM2 tool calling: parser and constrainer fixes, BFCL simple bench#595
LFM2 tool calling: parser and constrainer fixes, BFCL simple bench#595
Conversation
Three small bugs in cactus_utils.h surfaced while running BFCL simple_python against LFM2.5-350m (0/20 -> 11/20): 1. parse_tools_json pushed a phantom empty-name tool whenever an OpenAI schema used "function" as a property name inside parameters.properties (common for math / calculus tools). Guard the push with !tool.name.empty(). 2. append_lfm2_call stopped an unquoted value at the first comma, which broke list / dict arguments like interval=[1,3] (parsed as just [1). Bracket-match instead. 3. append_lfm2_call always re-wrapped values in quotes, turning radius=4 into "radius":"4" and failing BFCL's type-strict integer check. Preserve the original quoted-ness: quoted values stay strings, unquoted values (numbers, bools, containers) pass through as-is. Gemma 4 tool-calling pass rates unchanged (19/20 simple_python, 7/20 live_multiple). Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Once the parser no longer hands the constrainer an empty-name phantom
(previous commit), LFM_IN_FUNC_NAME reliably waits for the full
function name, but `(` can still slip in because `paren_open_tokens_`
was populated only by `add_tokens_for_string("(")`, which returns the
single BPE token for `(` in isolation. Merged tokens like `()` and
` (` go through unscathed, and the model occasionally picks one of
those right after a partial name, ending the name early.
Fix: add an `add_tokens_for_prefix_string` helper that scans the vocab
and adds every token whose decoded form starts with the given prefix.
Call it once for `(` to round out `paren_open_tokens_`, and add that
set to the block list in `LFM_IN_FUNC_NAME` so the state actually
prevents all paren-introducing tokens, not just the standalone `(`.
Scoped to the LFM2 branch of `tokenize_grammar_elements`; other model
families are untouched.
LFM2.5-350m simple_python: 11/20 -> 16/20. live_multiple / Gemma 4
unchanged.
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
There was a problem hiding this comment.
Pull request overview
This PR improves LFM2 tool-calling reliability across the stack (schema parsing → tool-call parsing → token-level constraints) and adds a minimal BFCL “simple_python” benchmark executable to quantify tool-call correctness.
Changes:
- Fix tool schema parsing and LFM2 Python-style tool-call argument parsing to round-trip into valid JSON more accurately.
- Expand LFM2 tool-call constraining to account for BPE-merged tokens related to parentheses during function-name emission.
- Add a self-contained BFCL “simple_python” benchmark runner under
tests/that executes 20 embedded cases and reports pass rate.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
tests/test_bfcl_simple.cpp |
Adds a single-file BFCL “simple_python” benchmark executable that runs 20 cases via cactus_complete. |
cactus/ffi/cactus_utils.h |
Adjusts tool schema parsing to skip empty-name phantom tools; improves LFM2 tool-call arg parsing (bracket matching + preserve quoted-ness). |
cactus/engine/engine_constraints.cpp |
Adds a vocab scan helper for prefix-matching tokens and uses it to strengthen LFM2 paren token sets; blocks paren-open tokens in LFM_IN_FUNC_NAME. |
cactus/engine/engine.h |
Declares the new ToolCallConstrainer::add_tokens_for_prefix_string helper. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| std::printf("%s\nPass rate: %zu/%zu (%.1f%%)\n", | ||
| std::string(60, '-').c_str(), passed, case_count, | ||
| 100.0 * static_cast<double>(passed) / static_cast<double>(case_count)); | ||
| return passed == case_count ? 0 : 1; |
There was a problem hiding this comment.
This executable exits non-zero unless all cases pass (return passed == case_count ? 0 : 1). Since tests/CMakeLists.txt builds every test_*.cpp and tests/run.sh runs every test_* binary by default with CACTUS_TEST_MODEL set, this will make the default test run fail for models that aren’t 20/20 (the PR description reports 16/20 and 19/20). Consider either (a) renaming this to a non-test_* target (bench) so it’s opt-in, or (b) gating with an env var / using a threshold-based exit code.
|
|
||
| tokenize_function_names(false); | ||
| tokenize_function_names(false); | ||
| add_tokens_for_prefix_string("(", paren_open_tokens_); |
There was a problem hiding this comment.
add_tokens_for_prefix_string("(", ...) only matches tokens whose decoded form starts with '('. Tokens like ' (' (leading space) or other merged forms where '(' is not at position 0 won’t be included, so they can still slip through during LFM_IN_FUNC_NAME despite the intent described in the PR. If the goal is “block any token that contains '(' while emitting the name”, consider scanning for tokens whose decoded form contains '(' (or adding prefix variants for the tokenizer’s whitespace marker, e.g. " (" / "▁(" / "Ġ(").
| add_tokens_for_prefix_string("(", paren_open_tokens_); | |
| add_tokens_for_prefix_string("(", paren_open_tokens_); | |
| add_tokens_for_prefix_string(" (", paren_open_tokens_); | |
| add_tokens_for_prefix_string("▁(", paren_open_tokens_); | |
| add_tokens_for_prefix_string("Ġ(", paren_open_tokens_); |
| void ToolCallConstrainer::add_tokens_for_prefix_string(const std::string& prefix, std::unordered_set<uint32_t>& token_set) { | ||
| if (!tokenizer_) return; | ||
| const uint32_t vocab = tokenizer_->get_vocab_size(); | ||
| for (uint32_t t = 0; t < vocab; t++) { | ||
| if (tokenizer_->decode({t}).rfind(prefix, 0) == 0) { |
There was a problem hiding this comment.
add_tokens_for_prefix_string() linearly scans the entire vocab and calls tokenizer_->decode({t}) for every token. ToolCallConstrainer::init() is invoked from Model::set_tool_constraints() per request, so this can add significant latency when tool constraints are enabled. Consider caching decoded token strings / precomputing the prefix->token_set once per tokenizer (or exposing a zero-allocation vocab string accessor) so repeated calls don’t rescan+decode the full vocab.
|
|
||
| tokenize_function_names(false); | ||
| tokenize_function_names(false); | ||
| add_tokens_for_prefix_string("(", paren_open_tokens_); |
There was a problem hiding this comment.
add_tokens_for_prefix_string("(", paren_open_tokens_) will add all tokens whose decoded text starts with "(", including multi-character merges like "()" or "(foo". In LFM_EXPECT_PAREN you later FORCE_BIAS every token in paren_open_tokens_, which can cause the model to emit a merged token like "()" and skip the ) state transition (since LFM_EXPECT_PAREN only checks for '('). Consider keeping a separate “exact open-paren tokens” set for forcing, and use a broader set only for blocking during LFM_IN_FUNC_NAME (or explicitly filter out tokens whose decoded form contains ')').
| add_tokens_for_prefix_string("(", paren_open_tokens_); |
| if (!first_arg) json_call += ","; | ||
| json_call += "\"" + arg_name + "\":\"" + arg_value + "\""; | ||
| json_call += "\"" + arg_name + "\":"; | ||
| json_call += quoted ? ("\"" + arg_value + "\"") : arg_value; |
There was a problem hiding this comment.
When quoted == false, arg_value is appended verbatim into JSON. This is correct for JSON primitives/arrays/objects, but will produce invalid JSON if the model emits an unquoted string identifier (e.g. unit=inches -> "unit":inches). Consider validating arg_value as a JSON literal (number/true/false/null/{/[) and otherwise quoting+escaping it so append_lfm2_call() always produces valid JSON.
Summary
cactus/ffi/cactus_utils.hso LFM2 Python-style tool calls round-trip to JSON correctly: drop a phantom empty-name tool when a schema has a property literally namedfunction, bracket-match unquoted values sointerval=[1,3]isn't cut to[1, and preserve the original quoted-ness soradius=4stays a number instead of becoming the string"4".LFM_IN_FUNC_NAMEstate:paren_open_tokens_was populated only byencode("("), missing BPE-merged tokens like()and(. Added a reusableadd_tokens_for_prefix_stringhelper that scans the vocab for every token whose decoded form starts with a given prefix, and use it to round out the set so no paren-carrying token can slip through while the name is still being emitted.tests/test_bfcl_simple.cpp(self-contained — only includescactus_ffi.hand std headers). Embeds 20 simple_python cases, runs them throughcactus_complete, and reports pass rate. Reads model path fromCACTUS_TEST_MODEL.Bench numbers
Test plan
CACTUS_TEST_MODEL=weights/lfm2.5-350m tests/build/test_bfcl_simple→ 15–16/20 passCACTUS_TEST_MODEL=weights/gemma-4-e2b-it tests/build/test_bfcl_simple→ 19/20 passtest_gemma4_tools_benchregressions: none (Gemma 4 numbers unchanged)