Not able to inference deepseek-coder-6.7b-instruct.Q5_K_M.gguf

### System Info

OS: Apple M1 Max
______________________
Name: langchain
Version: 0.0.349
Summary: Building applications with LLMs through composability
Home-page: https://github.com/langchain-ai/langchain
Author: 
Author-email: 
License: MIT
Requires: aiohttp, async-timeout, dataclasses-json, jsonpatch, langchain-community, langchain-core, langsmith, numpy, pydantic, PyYAML, requests, SQLAlchemy, tenacity
Required-by: 

### Who can help?

@hwchase17 @agola11 

### Information

- [X] The official example notebooks/scripts
- [ ] My own modified scripts

### Related Components

- [X] LLMs/Chat Models
- [ ] Embedding Models
- [X] Prompts / Prompt Templates / Prompt Selectors
- [ ] Output Parsers
- [ ] Document Loaders
- [ ] Vector Stores / Retrievers
- [ ] Memory
- [ ] Agents / Agent Executors
- [ ] Tools / Toolkits
- [ ] Chains
- [ ] Callbacks/Tracing
- [ ] Async

### Reproduction

Steps to reproduce:

I have followed the instructions provided here : https://python.langchain.com/docs/integrations/llms/llamacpp. 
Though not able inference it correctly.

Model path : https://huggingface.co/TheBloke/deepseek-coder-6.7B-instruct-GGUF

```
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import LLMChain, QAGenerationChain
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's work this out in a step by step way to be sure we have the right answer."""

prompt = PromptTemplate(template=template, input_variables=["question"])

callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
n_gpu_layers = 1  # Change this value based on your model and your GPU VRAM pool.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

llm = LlamaCpp(
    model_path="../models/deepcoder-gguf/deepseek-coder-6.7b-instruct.Q2_K.gguf",
    n_gpu_layers=n_gpu_layers,
    max_tokens=2000,
    top_p=1,
    n_batch=n_batch,
    callback_manager=callback_manager,
    f16_kv=True,
    verbose=True,  # Verbose is required to pass to the callback manager
)

llm(
    "Question: Write python program to add two numbers ? Answer:"
) 

```
Result:  `	<	""""""""""""""""""""""/"`



Requesting you to look into it.
Please let me know in case you need more information.
Thank you.



I have tried the same model file with **[llama-cpp-python](https://github.com/abetlen/llama-cpp-python)** package and it works as expected.
Please find below the code that I have tried:
```
import json
import time
from llama_cpp import Llama
n_gpu_layers = 1  # Change this value based on your model and your GPU VRAM pool.
n_batch = 512
llm = Llama(model_path="../models/deepcoder-gguf/deepseek-coder-6.7b-instruct.Q5_K_M.gguf" , chat_format="llama-2", n_gpu_layers=n_gpu_layers,n_batch=n_batch)



start_time = time.time()
pp = llm.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are an python language assistant."},
          {
              "role": "user",
              "content": "Write quick sort ."
          }
      ])

end_time = time.time()
print("execution time:", {end_time - start_time})
print(pp["choices"][0]["message"]["content"])
```

Output : 

```
## Quick Sort Algorithm in Python
Here is a simple implementation of the quicksort algorithm in Python:

```python
def partition(arr, low, high):
    i = (low-1)         # index of smaller element
    pivot = arr[high]     # pivot

    for j in range(low , high):
        if   arr[j] <= pivot:
            i += 1
            arr[i],arr[j] = arr[j],arr[i]

    arr[i+1],arr[high] = arr[high],arr[i+1]
    return (i+1)

def quickSort(arr, low, high):
    if low < high:
        pi = partition(arr,low,high)
        quickSort(arr, low, pi-1)
        quickSort(arr, pi+1, high)

# Test the code
n = int(input("Enter number of elements in array: "))
print("Enter elements: ")
arr = [int(input()) for _ in range(n)]
quickSort(arr,0,n-1)
print ("Sorted array is:")
for i in range(n):
    print("%d" %arr[i]),
This code first defines a helper function `partition()` that takes an array and two indices. It then rearranges the elements of the array so that all numbers less than or equal to the pivot are on its left, while all numbers greater than the pivot are on its right. The `quickSort()` function is then defined which recursively applies this partitioning process until the entire array is sorted.

The user can input their own list of integers and the program will output a sorted version of that list.
[/code]

Conclusion
In conclusion, Python provides several built-in functions for sorting lists such as `sort()` or `sorted()` but it's also possible to implement quick sort algorithm from scratch using custom function. This can be useful in situations where you need more control over the sorting process or when dealing with complex data structures.
```

### Expected behavior

It should inference the model just like the native [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) package.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Not able to inference deepseek-coder-6.7b-instruct.Q5_K_M.gguf #14593

System Info

Who can help?

Information

Related Components

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Not able to inference deepseek-coder-6.7b-instruct.Q5_K_M.gguf #14593

Description

System Info

Who can help?

Information

Related Components

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions