Skip to content

Not able to inference deepseek-coder-6.7b-instruct.Q5_K_M.gguf #14593

@Antriksh29071989

Description

@Antriksh29071989

System Info

OS: Apple M1 Max


Name: langchain
Version: 0.0.349
Summary: Building applications with LLMs through composability
Home-page: https://github.com/langchain-ai/langchain
Author:
Author-email:
License: MIT
Requires: aiohttp, async-timeout, dataclasses-json, jsonpatch, langchain-community, langchain-core, langsmith, numpy, pydantic, PyYAML, requests, SQLAlchemy, tenacity
Required-by:

Who can help?

@hwchase17 @agola11

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

Steps to reproduce:

I have followed the instructions provided here : https://python.langchain.com/docs/integrations/llms/llamacpp.
Though not able inference it correctly.

Model path : https://huggingface.co/TheBloke/deepseek-coder-6.7B-instruct-GGUF

from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import LLMChain, QAGenerationChain
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's work this out in a step by step way to be sure we have the right answer."""

prompt = PromptTemplate(template=template, input_variables=["question"])

callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
n_gpu_layers = 1  # Change this value based on your model and your GPU VRAM pool.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

llm = LlamaCpp(
    model_path="../models/deepcoder-gguf/deepseek-coder-6.7b-instruct.Q2_K.gguf",
    n_gpu_layers=n_gpu_layers,
    max_tokens=2000,
    top_p=1,
    n_batch=n_batch,
    callback_manager=callback_manager,
    f16_kv=True,
    verbose=True,  # Verbose is required to pass to the callback manager
)

llm(
    "Question: Write python program to add two numbers ? Answer:"
) 

Result: < """"""""""""""""""""""/"

Requesting you to look into it.
Please let me know in case you need more information.
Thank you.

I have tried the same model file with llama-cpp-python package and it works as expected.
Please find below the code that I have tried:

import json
import time
from llama_cpp import Llama
n_gpu_layers = 1  # Change this value based on your model and your GPU VRAM pool.
n_batch = 512
llm = Llama(model_path="../models/deepcoder-gguf/deepseek-coder-6.7b-instruct.Q5_K_M.gguf" , chat_format="llama-2", n_gpu_layers=n_gpu_layers,n_batch=n_batch)



start_time = time.time()
pp = llm.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are an python language assistant."},
          {
              "role": "user",
              "content": "Write quick sort ."
          }
      ])

end_time = time.time()
print("execution time:", {end_time - start_time})
print(pp["choices"][0]["message"]["content"])

Output :

## Quick Sort Algorithm in Python
Here is a simple implementation of the quicksort algorithm in Python:

```python
def partition(arr, low, high):
    i = (low-1)         # index of smaller element
    pivot = arr[high]     # pivot

    for j in range(low , high):
        if   arr[j] <= pivot:
            i += 1
            arr[i],arr[j] = arr[j],arr[i]

    arr[i+1],arr[high] = arr[high],arr[i+1]
    return (i+1)

def quickSort(arr, low, high):
    if low < high:
        pi = partition(arr,low,high)
        quickSort(arr, low, pi-1)
        quickSort(arr, pi+1, high)

# Test the code
n = int(input("Enter number of elements in array: "))
print("Enter elements: ")
arr = [int(input()) for _ in range(n)]
quickSort(arr,0,n-1)
print ("Sorted array is:")
for i in range(n):
    print("%d" %arr[i]),
This code first defines a helper function `partition()` that takes an array and two indices. It then rearranges the elements of the array so that all numbers less than or equal to the pivot are on its left, while all numbers greater than the pivot are on its right. The `quickSort()` function is then defined which recursively applies this partitioning process until the entire array is sorted.

The user can input their own list of integers and the program will output a sorted version of that list.
[/code]

Conclusion
In conclusion, Python provides several built-in functions for sorting lists such as `sort()` or `sorted()` but it's also possible to implement quick sort algorithm from scratch using custom function. This can be useful in situations where you need more control over the sorting process or when dealing with complex data structures.

Expected behavior

It should inference the model just like the native llama-cpp-python package.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugRelated to a bug, vulnerability, unexpected error with an existing feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions