enable deep <think> mode for local ollama #286

zcuder · 2025-07-19T02:52:11Z

enable local ollama qwen3 mode to improve quality

gemini-code-assist

Summary of Changes

Hello @zcuder, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a 'deep think' mode specifically for local Ollama models. It modifies how prompts are constructed to encourage Ollama to generate internal reasoning, while simultaneously ensuring that these internal thoughts are not exposed to the end-user by filtering them out from the streamed responses.

Highlights

Ollama Integration: The system now conditionally sets the /think directive in the initial and simplified prompts when the request.provider is 'ollama'. This enables Ollama models to generate internal thought processes.
Response Filtering: Implemented new logic to filter out content enclosed within <think> and </think> tags from Ollama's streaming responses. This ensures that the model's internal reasoning or 'deep think' output is not displayed to the end-user, maintaining a clean and concise user experience.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request enables a <think> mode for the ollama provider. The review identified opportunities to improve code maintainability by reducing duplication, correcting prompt construction, and ensuring robust handling of streaming responses. A critical bug was found in the fallback logic that needs immediate attention.

gemini-code-assist · 2025-07-19T02:54:06Z

api/websocket_wiki.py

@@ -639,11 +652,16 @@ async def handle_websocket_chat(websocket: WebSocket):
                        fallback_response = await model.acall(api_kwargs=fallback_api_kwargs, model_type=ModelType.LLM)

                        # Handle streaming fallback_response from Ollama
-                        async for chunk in fallback_response:
+                        think = False
+                        async for chunk in response:


There is a critical bug in this fallback logic. The code iterates over response instead of fallback_response. The fallback_response is the new response that should be streamed, while response is the original one that caused the token limit error. This will likely cause incorrect behavior or an infinite loop.

Additionally, the streaming logic here is a duplicate of the one at lines 533-544. Please consider refactoring the streaming logic into a single helper function to be used in both places.

async for chunk in fallback_response:

gemini-code-assist · 2025-07-19T02:54:07Z

api/websocket_wiki.py

@@ -428,7 +431,7 @@ async def handle_websocket_chat(websocket: WebSocket):
        model_config = get_model_config(request.provider, request.model)["model_kwargs"]

        if request.provider == "ollama":
-            prompt += " /no_think"
+            prompt += " /think"


The prompt is being appended with /think here, but it's already added at the beginning of the prompt on line 407. This results in a prompt for ollama that both starts and ends with a think command, which is likely not the intended behavior and could confuse the model. A similar issue exists for the simplified_prompt on line 642. Remove this line.

gemini-code-assist · 2025-07-19T02:54:07Z

api/websocket_wiki.py

+                think = False
                async for chunk in response:
                    text = getattr(chunk, 'response', None) or getattr(chunk, 'text', None) or str(chunk)
                    if text and not text.startswith('model=') and not text.startswith('created_at='):
-                        text = text.replace('<think>', '').replace('</think>', '')
-                        await websocket.send_text(text)
+                        if text == '<think>':
+                            think = True
+                            logger.info("think enabled")
+                        elif text == '</think>':
+                            think = False
+                        # skip <think>.....</think> in output
+                        if not think:
+                            await websocket.send_text(text)


The current logic for handling <think> and </think> tags is brittle because it assumes that these tags will always arrive in their own separate chunks. If a chunk contains a tag along with other text (e.g., </think>Here is the answer.), the logic will fail to process it correctly. This could lead to missing or incomplete model responses.

A more robust implementation using a state machine would handle various chunking scenarios correctly. This logic is also duplicated in the fallback handler (lines 655-664). Consider extracting it into a helper function.

in_think_block = False async for chunk in response: text = getattr(chunk, 'response', None) or getattr(chunk, 'text', None) or str(chunk) if not (text and not text.startswith('model=') and not text.startswith('created_at=')): continue while text: if in_think_block: end_tag_pos = text.find('</think>') if end_tag_pos != -1: in_think_block = False text = text[end_tag_pos + len('</think>'):] else: # The rest of the chunk is inside the think block, so we discard it. text = '' else: # Not in a think block start_tag_pos = text.find('<think>') if start_tag_pos != -1: # Send content before the <think> tag if start_tag_pos > 0: await websocket.send_text(text[:start_tag_pos]) in_think_block = True text = text[start_tag_pos + len('<think>'):] else: # No <think> tag in the chunk, send it all await websocket.send_text(text) text = ''

gemini-code-assist · 2025-07-19T02:54:07Z

api/websocket_wiki.py

+        if request.provider == "ollama":
+            prompt = f"/think {system_prompt}\n\n"
+        else:
+            prompt = f"/no_think {system_prompt}\n\n"


To improve maintainability and reduce code duplication, define the think/no_think command in a variable and reuse it. The same logic is repeated for simplified_prompt in the exception handler on lines 627-630.

think_mode_command = "/think" if request.provider == "ollama" else "/no_think" prompt = f"{think_mode_command} {system_prompt}\n\n"

gemini-code-assist bot reviewed Jul 19, 2025

View reviewed changes

enable deep <think> mode for local ollama

530a2b6

zcuder force-pushed the deep_think_ollama branch from 9dad57e to 530a2b6 Compare July 19, 2025 02:53

gemini-code-assist bot reviewed Jul 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

enable deep <think> mode for local ollama #286

enable deep <think> mode for local ollama #286

Uh oh!

zcuder commented Jul 19, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 19, 2025

Uh oh!

gemini-code-assist bot Jul 19, 2025

Uh oh!

gemini-code-assist bot Jul 19, 2025

Uh oh!

gemini-code-assist bot Jul 19, 2025

Uh oh!

Uh oh!

enable deep <think> mode for local ollama #286

Are you sure you want to change the base?

enable deep <think> mode for local ollama #286

Uh oh!

Conversation

zcuder commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zcuder commented Jul 19, 2025 •

edited

Loading