Replies: 1 comment
-
| This looks like a description of "speculative decoding", there are a couple of llama.cpp examples implementing it here: https://github.com/ggml-org/llama.cpp/tree/master/examples/speculative, https://github.com/ggml-org/llama.cpp/tree/master/examples/speculative-simple. It's not currently supported at all in the high level executors. It's probably possible to implement using the BatchedExecutor (I sketched out a prototype a while ago, never quite got it working though). It should definitely be possible to implement using the low level/native API (we just directly expose all the llama.cpp calls). | 
Beta Was this translation helpful? Give feedback.
                  
                    0 replies
                  
                
            
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
        
    
Uh oh!
There was an error while loading. Please reload this page.
-
Is it possible to do something similar on llamasharp now?
https://huggingface.co/blog/assisted-generation
Beta Was this translation helpful? Give feedback.
All reactions