Skip to content

enable deep <think> mode for local ollama #286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zcuder
Copy link
Contributor

@zcuder zcuder commented Jul 19, 2025

enable local ollama qwen3 mode to improve quality

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @zcuder, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a 'deep think' mode specifically for local Ollama models. It modifies how prompts are constructed to encourage Ollama to generate internal reasoning, while simultaneously ensuring that these internal thoughts are not exposed to the end-user by filtering them out from the streamed responses.

Highlights

  • Ollama Integration: The system now conditionally sets the /think directive in the initial and simplified prompts when the request.provider is 'ollama'. This enables Ollama models to generate internal thought processes.
  • Response Filtering: Implemented new logic to filter out content enclosed within <think> and </think> tags from Ollama's streaming responses. This ensures that the model's internal reasoning or 'deep think' output is not displayed to the end-user, maintaining a clean and concise user experience.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@zcuder zcuder force-pushed the deep_think_ollama branch from 9dad57e to 530a2b6 Compare July 19, 2025 02:53
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables a <think> mode for the ollama provider. The review identified opportunities to improve code maintainability by reducing duplication, correcting prompt construction, and ensuring robust handling of streaming responses. A critical bug was found in the fallback logic that needs immediate attention.

@@ -639,11 +652,16 @@ async def handle_websocket_chat(websocket: WebSocket):
fallback_response = await model.acall(api_kwargs=fallback_api_kwargs, model_type=ModelType.LLM)

# Handle streaming fallback_response from Ollama
async for chunk in fallback_response:
think = False
async for chunk in response:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a critical bug in this fallback logic. The code iterates over response instead of fallback_response. The fallback_response is the new response that should be streamed, while response is the original one that caused the token limit error. This will likely cause incorrect behavior or an infinite loop.

Additionally, the streaming logic here is a duplicate of the one at lines 533-544. Please consider refactoring the streaming logic into a single helper function to be used in both places.

async for chunk in fallback_response:

@@ -428,7 +431,7 @@ async def handle_websocket_chat(websocket: WebSocket):
model_config = get_model_config(request.provider, request.model)["model_kwargs"]

if request.provider == "ollama":
prompt += " /no_think"
prompt += " /think"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The prompt is being appended with /think here, but it's already added at the beginning of the prompt on line 407. This results in a prompt for ollama that both starts and ends with a think command, which is likely not the intended behavior and could confuse the model. A similar issue exists for the simplified_prompt on line 642. Remove this line.

Comment on lines +533 to +544
think = False
async for chunk in response:
text = getattr(chunk, 'response', None) or getattr(chunk, 'text', None) or str(chunk)
if text and not text.startswith('model=') and not text.startswith('created_at='):
text = text.replace('<think>', '').replace('</think>', '')
await websocket.send_text(text)
if text == '<think>':
think = True
logger.info("think enabled")
elif text == '</think>':
think = False
# skip <think>.....</think> in output
if not think:
await websocket.send_text(text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current logic for handling <think> and </think> tags is brittle because it assumes that these tags will always arrive in their own separate chunks. If a chunk contains a tag along with other text (e.g., </think>Here is the answer.), the logic will fail to process it correctly. This could lead to missing or incomplete model responses.

A more robust implementation using a state machine would handle various chunking scenarios correctly. This logic is also duplicated in the fallback handler (lines 655-664). Consider extracting it into a helper function.

in_think_block = False
async for chunk in response:
    text = getattr(chunk, 'response', None) or getattr(chunk, 'text', None) or str(chunk)
    if not (text and not text.startswith('model=') and not text.startswith('created_at=')):
        continue

    while text:
        if in_think_block:
            end_tag_pos = text.find('</think>')
            if end_tag_pos != -1:
                in_think_block = False
                text = text[end_tag_pos + len('</think>'):]
            else:
                # The rest of the chunk is inside the think block, so we discard it.
                text = ''
        else:  # Not in a think block
            start_tag_pos = text.find('<think>')
            if start_tag_pos != -1:
                # Send content before the <think> tag
                if start_tag_pos > 0:
                    await websocket.send_text(text[:start_tag_pos])
                in_think_block = True
                text = text[start_tag_pos + len('<think>'):]
            else:
                # No <think> tag in the chunk, send it all
                await websocket.send_text(text)
                text = ''

Comment on lines +406 to +409
if request.provider == "ollama":
prompt = f"/think {system_prompt}\n\n"
else:
prompt = f"/no_think {system_prompt}\n\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve maintainability and reduce code duplication, define the think/no_think command in a variable and reuse it. The same logic is repeated for simplified_prompt in the exception handler on lines 627-630.

think_mode_command = "/think" if request.provider == "ollama" else "/no_think"
prompt = f"{think_mode_command} {system_prompt}\n\n"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant