Skip to content

examples : predicted output for text generation #14739

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

iamlemec
Copy link
Collaborator

This adds an example that allows the user to specify predicted (expected) outputs that are then used as a draft to speed up generation. The most prominent use case for this would be for something like making changes to an existing code block. Currently OpenAI has a similar feature with Predicted Outputs.

Obviously this has a lot of overlap with speculative decoding and lookup decoding. I think the main difference is that this gives the user more direct control over the expected output. This will also try to pick the draft up again if there is a difference followed by a few consecutive token matches. So in that sense, it brings in some of the benefits of lookup decoding.

I added in some example scripts for testing code modification. These can compare predicted vs speculative and lookup. I also included a script that uses the Osmosis-Apply-1.7B model that is directly targeted towards code patching.

@CISC CISC requested a review from Copilot July 24, 2025 10:50
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a predicted output generation feature that allows users to provide expected/draft text to accelerate text generation. The implementation includes an indexing mechanism to find matches between generated tokens and the draft text, with recovery capabilities when predictions fail.

  • Implements a predicted output generation algorithm with draft text matching and recovery
  • Adds shell scripts for testing and comparing with speculative and lookup decoding approaches
  • Integrates the new predicted example into the existing build system and command-line argument parsing

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
examples/predicted/predicted.cpp Core implementation of predicted output generation with draft text indexing and matching logic
examples/predicted/CMakeLists.txt Build configuration for the predicted example executable
examples/predicted/*.sh Shell scripts for testing predicted output against speculative and lookup decoding
examples/predicted/README.md Documentation explaining the predicted output algorithm
examples/predicted/data/patch_code.txt Sample test data for the predicted output feature
examples/CMakeLists.txt Integration of predicted example into main build system
common/common.h Addition of LLAMA_EXAMPLE_PREDICTED enum and draft text parameter
common/arg.cpp Command-line argument parsing support for draft text and example integration
Comments suppressed due to low confidence (1)

examples/predicted/predicted.cpp:15

  • [nitpick] The typedef name 'draft_index_t' uses a '_t' suffix which is typically reserved for POSIX types. Consider using 'DraftIndex' or 'draft_index' instead.
typedef std::unordered_map<common_ngram, std::vector<int64_t>, common_ngram_hash_function> draft_index_t;


// eval the prompt
const int n_prompt = inp.size();
llama_decode(ctx, llama_batch_get_one(inp.data(), n_prompt - 1));
Copy link
Preview

Copilot AI Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The expression 'n_prompt - 1' appears multiple times in the code. Consider storing this value in a named variable like 'n_prompt_tokens' for better readability and maintainability.

Suggested change
llama_decode(ctx, llama_batch_get_one(inp.data(), n_prompt - 1));
const int n_prompt_tokens = n_prompt - 1;
llama_decode(ctx, llama_batch_get_one(inp.data(), n_prompt_tokens));

Copilot uses AI. Check for mistakes.

// process the accepted tokens and update contexts
const std::string token_str = common_token_to_piece(ctx, id_last);
if (params.use_color && use_draft) {
LOG("\u001b[%dm%s\u001b[37m", (36 - 0 % 6), token_str.c_str());
Copy link
Preview

Copilot AI Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expression '(36 - 0 % 6)' evaluates to a constant value of 36. This appears to be a magic number that should be defined as a named constant or explained with a comment.

Suggested change
LOG("\u001b[%dm%s\u001b[37m", (36 - 0 % 6), token_str.c_str());
LOG("\u001b[%dm%s\u001b[37m", CYAN_COLOR_CODE, token_str.c_str());

Copilot uses AI. Check for mistakes.

const int draft_max = params.speculative.n_max;

// draft text to use for prediction
std::string draft_text = params.speculative.text;
Copy link
Preview

Copilot AI Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The boolean parameters 'false, true' passed to common_tokenize are unclear. Consider adding comments to explain what these parameters control (add_special and parse_special).

Suggested change
std::string draft_text = params.speculative.text;
std::string draft_text = params.speculative.text;
// Tokenize the draft text:
// - `false` indicates that special tokens should not be added.
// - `true` indicates that special tokens should be parsed if present.

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant