Skip to content

Commit a520574

Browse files
authored
Add google embedder support (#192)
* feat: Add comprehensive test suite for DeepWiki project including unit, integration, and API tests * feat: Add support for Google AI embeddings and update embedder configuration * fix: Update authorization code parameter to be optional in delete_wiki_cache function * feat: Refactor embedder handling to support multiple types and improve backward compatibility
1 parent 8aebd4b commit a520574

19 files changed

+1635
-61
lines changed

README.md

Lines changed: 97 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
- **Ask Feature**: Chat with your repository using RAG-powered AI to get accurate answers
2727
- **DeepResearch**: Multi-turn research process that thoroughly investigates complex topics
2828
- **Multiple Model Providers**: Support for Google Gemini, OpenAI, OpenRouter, and local Ollama models
29+
- **Flexible Embeddings**: Choose between OpenAI, Google AI, or local Ollama embeddings for optimal performance
2930

3031
## 🚀 Quick Start (Super Easy!)
3132

@@ -39,6 +40,8 @@ cd deepwiki-open
3940
# Create a .env file with your API keys
4041
echo "GOOGLE_API_KEY=your_google_api_key" > .env
4142
echo "OPENAI_API_KEY=your_openai_api_key" >> .env
43+
# Optional: Use Google AI embeddings instead of OpenAI (recommended if using Google models)
44+
echo "DEEPWIKI_EMBEDDER_TYPE=google" >> .env
4245
# Optional: Add OpenRouter API key if you want to use OpenRouter models
4346
echo "OPENROUTER_API_KEY=your_openrouter_api_key" >> .env
4447
# Optional: Add Ollama host if not local. defaults to http://localhost:11434
@@ -67,6 +70,8 @@ Create a `.env` file in the project root with these keys:
6770
```
6871
GOOGLE_API_KEY=your_google_api_key
6972
OPENAI_API_KEY=your_openai_api_key
73+
# Optional: Use Google AI embeddings (recommended if using Google models)
74+
DEEPWIKI_EMBEDDER_TYPE=google
7075
# Optional: Add this if you want to use OpenRouter models
7176
OPENROUTER_API_KEY=your_openrouter_api_key
7277
# Optional: Add this if you want to use Azure OpenAI models
@@ -269,6 +274,89 @@ If you want to use embedding models compatible with the OpenAI API (such as Alib
269274

270275
This allows you to seamlessly switch to any OpenAI-compatible embedding service without code changes.
271276

277+
## 🧠 Using Google AI Embeddings
278+
279+
DeepWiki now supports Google AI's latest embedding models as an alternative to OpenAI embeddings. This provides better integration when you're already using Google Gemini models for text generation.
280+
281+
### Features
282+
283+
- **Latest Model**: Uses Google's `text-embedding-004` model
284+
- **Same API Key**: Uses your existing `GOOGLE_API_KEY` (no additional setup required)
285+
- **Better Integration**: Optimized for use with Google Gemini text generation models
286+
- **Task-Specific**: Supports semantic similarity, retrieval, and classification tasks
287+
- **Batch Processing**: Efficient processing of multiple texts
288+
289+
### How to Enable Google AI Embeddings
290+
291+
**Option 1: Environment Variable (Recommended)**
292+
293+
Set the embedder type in your `.env` file:
294+
295+
```bash
296+
# Your existing Google API key
297+
GOOGLE_API_KEY=your_google_api_key
298+
299+
# Enable Google AI embeddings
300+
DEEPWIKI_EMBEDDER_TYPE=google
301+
```
302+
303+
**Option 2: Docker Environment**
304+
305+
```bash
306+
docker run -p 8001:8001 -p 3000:3000 \
307+
-e GOOGLE_API_KEY=your_google_api_key \
308+
-e DEEPWIKI_EMBEDDER_TYPE=google \
309+
-v ~/.adalflow:/root/.adalflow \
310+
ghcr.io/asyncfuncai/deepwiki-open:latest
311+
```
312+
313+
**Option 3: Docker Compose**
314+
315+
Add to your `.env` file:
316+
317+
```bash
318+
GOOGLE_API_KEY=your_google_api_key
319+
DEEPWIKI_EMBEDDER_TYPE=google
320+
```
321+
322+
Then run:
323+
324+
```bash
325+
docker-compose up
326+
```
327+
328+
### Available Embedder Types
329+
330+
| Type | Description | API Key Required | Notes |
331+
|------|-------------|------------------|-------|
332+
| `openai` | OpenAI embeddings (default) | `OPENAI_API_KEY` | Uses `text-embedding-3-small` model |
333+
| `google` | Google AI embeddings | `GOOGLE_API_KEY` | Uses `text-embedding-004` model |
334+
| `ollama` | Local Ollama embeddings | None | Requires local Ollama installation |
335+
336+
### Why Use Google AI Embeddings?
337+
338+
- **Consistency**: If you're using Google Gemini for text generation, using Google embeddings provides better semantic consistency
339+
- **Performance**: Google's latest embedding model offers excellent performance for retrieval tasks
340+
- **Cost**: Competitive pricing compared to OpenAI
341+
- **No Additional Setup**: Uses the same API key as your text generation models
342+
343+
### Switching Between Embedders
344+
345+
You can easily switch between different embedding providers:
346+
347+
```bash
348+
# Use OpenAI embeddings (default)
349+
export DEEPWIKI_EMBEDDER_TYPE=openai
350+
351+
# Use Google AI embeddings
352+
export DEEPWIKI_EMBEDDER_TYPE=google
353+
354+
# Use local Ollama embeddings
355+
export DEEPWIKI_EMBEDDER_TYPE=ollama
356+
```
357+
358+
**Note**: When switching embedders, you may need to regenerate your repository embeddings as different models produce different vector spaces.
359+
272360
### Logging
273361

274362
DeepWiki uses Python's built-in `logging` module for diagnostic output. You can configure the verbosity and log file destination via environment variables:
@@ -311,19 +399,25 @@ docker-compose up
311399

312400
| Variable | Description | Required | Note |
313401
|----------------------|--------------------------------------------------------------|----------|----------------------------------------------------------------------------------------------------------|
314-
| `GOOGLE_API_KEY` | Google Gemini API key for AI generation | No | Required only if you want to use Google Gemini models
315-
| `OPENAI_API_KEY` | OpenAI API key for embeddings | Yes | Note: This is required even if you're not using OpenAI models, as it's used for embeddings. |
402+
| `GOOGLE_API_KEY` | Google Gemini API key for AI generation and embeddings | No | Required for Google Gemini models and Google AI embeddings
403+
| `OPENAI_API_KEY` | OpenAI API key for embeddings and models | Conditional | Required if using OpenAI embeddings or models |
316404
| `OPENROUTER_API_KEY` | OpenRouter API key for alternative models | No | Required only if you want to use OpenRouter models |
317405
| `AZURE_OPENAI_API_KEY` | Azure OpenAI API key | No | Required only if you want to use Azure OpenAI models |
318406
| `AZURE_OPENAI_ENDPOINT` | Azure OpenAI endpoint | No | Required only if you want to use Azure OpenAI models |
319407
| `AZURE_OPENAI_VERSION` | Azure OpenAI version | No | Required only if you want to use Azure OpenAI models |
320408
| `OLLAMA_HOST` | Ollama Host (default: http://localhost:11434) | No | Required only if you want to use external Ollama server |
409+
| `DEEPWIKI_EMBEDDER_TYPE` | Embedder type: `openai`, `google`, or `ollama` (default: `openai`) | No | Controls which embedding provider to use |
321410
| `PORT` | Port for the API server (default: 8001) | No | If you host API and frontend on the same machine, make sure change port of `SERVER_BASE_URL` accordingly |
322411
| `SERVER_BASE_URL` | Base URL for the API server (default: http://localhost:8001) | No |
323412
| `DEEPWIKI_AUTH_MODE` | Set to `true` or `1` to enable authorization mode. | No | Defaults to `false`. If enabled, `DEEPWIKI_AUTH_CODE` is required. |
324413
| `DEEPWIKI_AUTH_CODE` | The secret code required for wiki generation when `DEEPWIKI_AUTH_MODE` is enabled. | No | Only used if `DEEPWIKI_AUTH_MODE` is `true` or `1`. |
325414

326-
If you're not using ollama mode, you need to configure an OpenAI API key for embeddings. Other API keys are only required when configuring and using models from the corresponding providers.
415+
**API Key Requirements:**
416+
- If using `DEEPWIKI_EMBEDDER_TYPE=openai` (default): `OPENAI_API_KEY` is required
417+
- If using `DEEPWIKI_EMBEDDER_TYPE=google`: `GOOGLE_API_KEY` is required
418+
- If using `DEEPWIKI_EMBEDDER_TYPE=ollama`: No API key required (local processing)
419+
420+
Other API keys are only required when configuring and using models from the corresponding providers.
327421

328422
## Authorization Mode
329423

api/api.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -519,7 +519,7 @@ async def delete_wiki_cache(
519519

520520
if WIKI_AUTH_MODE:
521521
logger.info("check the authorization code")
522-
if WIKI_AUTH_CODE != authorization_code:
522+
if not authorization_code or WIKI_AUTH_CODE != authorization_code:
523523
raise HTTPException(status_code=401, detail="Authorization code is invalid")
524524

525525
logger.info(f"Attempting to delete wiki cache for {owner}/{repo} ({repo_type}), lang: {language}")

api/config.py

Lines changed: 49 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
from api.openai_client import OpenAIClient
1111
from api.openrouter_client import OpenRouterClient
1212
from api.bedrock_client import BedrockClient
13+
from api.google_embedder_client import GoogleEmbedderClient
1314
from api.azureai_client import AzureAIClient
1415
from api.dashscope_client import DashscopeClient
1516
from adalflow import GoogleGenAIClient, OllamaClient
@@ -44,12 +45,16 @@
4445
WIKI_AUTH_MODE = raw_auth_mode.lower() in ['true', '1', 't']
4546
WIKI_AUTH_CODE = os.environ.get('DEEPWIKI_AUTH_CODE', '')
4647

48+
# Embedder settings
49+
EMBEDDER_TYPE = os.environ.get('DEEPWIKI_EMBEDDER_TYPE', 'openai').lower()
50+
4751
# Get configuration directory from environment variable, or use default if not set
4852
CONFIG_DIR = os.environ.get('DEEPWIKI_CONFIG_DIR', None)
4953

5054
# Client class mapping
5155
CLIENT_CLASSES = {
5256
"GoogleGenAIClient": GoogleGenAIClient,
57+
"GoogleEmbedderClient": GoogleEmbedderClient,
5358
"OpenAIClient": OpenAIClient,
5459
"OpenRouterClient": OpenRouterClient,
5560
"OllamaClient": OllamaClient,
@@ -144,7 +149,7 @@ def load_embedder_config():
144149
embedder_config = load_json_config("embedder.json")
145150

146151
# Process client classes
147-
for key in ["embedder", "embedder_ollama"]:
152+
for key in ["embedder", "embedder_ollama", "embedder_google"]:
148153
if key in embedder_config and "client_class" in embedder_config[key]:
149154
class_name = embedder_config[key]["client_class"]
150155
if class_name in CLIENT_CLASSES:
@@ -154,12 +159,18 @@ def load_embedder_config():
154159

155160
def get_embedder_config():
156161
"""
157-
Get the current embedder configuration.
162+
Get the current embedder configuration based on DEEPWIKI_EMBEDDER_TYPE.
158163
159164
Returns:
160165
dict: The embedder configuration with model_client resolved
161166
"""
162-
return configs.get("embedder", {})
167+
embedder_type = EMBEDDER_TYPE
168+
if embedder_type == 'google' and 'embedder_google' in configs:
169+
return configs.get("embedder_google", {})
170+
elif embedder_type == 'ollama' and 'embedder_ollama' in configs:
171+
return configs.get("embedder_ollama", {})
172+
else:
173+
return configs.get("embedder", {})
163174

164175
def is_ollama_embedder():
165176
"""
@@ -181,6 +192,40 @@ def is_ollama_embedder():
181192
client_class = embedder_config.get("client_class", "")
182193
return client_class == "OllamaClient"
183194

195+
def is_google_embedder():
196+
"""
197+
Check if the current embedder configuration uses GoogleEmbedderClient.
198+
199+
Returns:
200+
bool: True if using GoogleEmbedderClient, False otherwise
201+
"""
202+
embedder_config = get_embedder_config()
203+
if not embedder_config:
204+
return False
205+
206+
# Check if model_client is GoogleEmbedderClient
207+
model_client = embedder_config.get("model_client")
208+
if model_client:
209+
return model_client.__name__ == "GoogleEmbedderClient"
210+
211+
# Fallback: check client_class string
212+
client_class = embedder_config.get("client_class", "")
213+
return client_class == "GoogleEmbedderClient"
214+
215+
def get_embedder_type():
216+
"""
217+
Get the current embedder type based on configuration.
218+
219+
Returns:
220+
str: 'ollama', 'google', or 'openai' (default)
221+
"""
222+
if is_ollama_embedder():
223+
return 'ollama'
224+
elif is_google_embedder():
225+
return 'google'
226+
else:
227+
return 'openai'
228+
184229
# Load repository and file filters configuration
185230
def load_repo_config():
186231
return load_json_config("repo.json")
@@ -271,7 +316,7 @@ def load_lang_config():
271316

272317
# Update embedder configuration
273318
if embedder_config:
274-
for key in ["embedder", "embedder_ollama", "retriever", "text_splitter"]:
319+
for key in ["embedder", "embedder_ollama", "embedder_google", "retriever", "text_splitter"]:
275320
if key in embedder_config:
276321
configs[key] = embedder_config[key]
277322

api/config/embedder.json

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,20 @@
88
"encoding_format": "float"
99
}
1010
},
11+
"embedder_ollama": {
12+
"client_class": "OllamaClient",
13+
"model_kwargs": {
14+
"model": "nomic-embed-text"
15+
}
16+
},
17+
"embedder_google": {
18+
"client_class": "GoogleEmbedderClient",
19+
"batch_size": 100,
20+
"model_kwargs": {
21+
"model": "text-embedding-004",
22+
"task_type": "SEMANTIC_SIMILARITY"
23+
}
24+
},
1125
"retriever": {
1226
"top_k": 20
1327
},

0 commit comments

Comments
 (0)