Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 33 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,12 @@
[badge-zenodo]: https://zenodo.org/badge/899554552.svg


🧬 CellAnnotator is an [scverse ecosystem package](https://scverse.org/packages/#ecosystem), designed to annotate cell types in scRNA-seq data based on marker genes using large language models (LLMs). It supports OpenAI, Google Gemini, and Anthropic Claude models out of the box, with more providers planned for the future.
🧬 CellAnnotator is an [scverse ecosystem package](https://scverse.org/packages/#ecosystem), designed to annotate cell types in scRNA-seq data based on marker genes using large language models (LLMs). It supports OpenAI, Google Gemini, Anthropic Claude, and OpenRouter models out of the box.


## ✨ Key Features

- 🤖 **LLM-agnostic backend**: Seamlessly use models from OpenAI, Anthropic (Claude), and Gemini (Google) — just set your provider and API key.
- 🤖 **LLM-agnostic backend**: Seamlessly use models from OpenAI, Anthropic (Claude), Gemini (Google), or OpenRouter — just set your provider and API key.
- 🧬 **Automatically annotate cells** including type, state, and confidence fields.
- 🔄 **Consistent annotations** across all samples in your study.
- 🧠 **Infuse prior knowledge** by providing information about your biological system.
Expand Down Expand Up @@ -60,6 +60,7 @@ After installation, head over to the LLM provider of your choice to generate an
- OpenAI: [API key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key)
- Google (Gemini): [API key](https://ai.google.dev/gemini-api/docs/api-key)
- Anthropic (Claude): [API key](https://docs.anthropic.com/en/docs/get-started)
- OpenRouter: [API key](https://openrouter.ai/settings/keys)


🔒 Keep this key private and don't share it with anyone. `CellAnnotator` will try to read the key as an environmental variable - either expose it to the environment yourself, or store it as an `.env` file anywhere within the repository where you conduct your analysis and plan to run `CellAnnotator`. The package will then use [dotenv](https://pypi.org/project/python-dotenv/) to export the key from the `env` file as an environmental variable.
Expand All @@ -78,6 +79,31 @@ cell_ann = CellAnnotator(

By default, this will store annotations in `adata.obs['cell_type_predicted']`. Head over to our 📚 [tutorials](https://cell-annotator.readthedocs.io/en/latest/notebooks/tutorials/index.html) to see more advanced use cases, and learn how to adapt this to your own data. You can run `CellAnnotator` for just a single sample of data, or across multiple samples. In the latter case, it will attempt to harmonize annotations across samples.

### Advanced provider options

`CellAnnotator` can also be used in single-sample mode by setting `sample_key=None`.

Example:

```python
from cell_annotator import CellAnnotator

cell_ann = CellAnnotator(
adata=adata,
species="human",
tissue="pancreas",
cluster_key="leiden_1",
sample_key=None, # single-sample mode
provider="openrouter",
model="openai/gpt-4o-mini",
api_key="YOUR_OPENROUTER_API_KEY",
)

cell_ann.get_expected_cell_type_markers(n_markers=3)
cell_ann.get_cluster_markers()
cell_ann.annotate_clusters(key_added="cell_type_predicted")
```



## 💸 Costs and models
Expand All @@ -89,15 +115,19 @@ CellAnnotator is LLM-agnostic and works with multiple providers:

- **Anthropic Claude:** Claude models are supported. See the [Anthropic pricing page](https://docs.anthropic.com/claude/docs/pricing) for details.

- **OpenRouter:** OpenRouter routes requests to many model families (including OpenAI, Anthropic, and others) behind a single API key. Use `provider="openrouter"` and pass a model slug such as `openai/gpt-4o-mini` or `anthropic/claude-3.5-sonnet`.

You can select your provider and model by setting the appropriate parameters. More providers may be supported in the future as the LLM ecosystem evolves.



## 🔐 Data privacy
This package sends cluster marker genes, and the `species` and `tissue` you define, to the selected LLM provider (e.g., OpenAI, Google, or Anthropic). **No actual gene expression values are sent.**
This package sends cluster marker genes, and the `species` and `tissue` you define, to the selected LLM provider (e.g., OpenAI, Google, Anthropic, or OpenRouter routes). **No actual gene expression values are sent.**
Comment thread
Marius1311 marked this conversation as resolved.

Please ensure your usage of this package aligns with your institution's guidelines on data privacy and the use of external AI models. Each provider has its own privacy policy and terms of service. Review these carefully before using CellAnnotator with sensitive or regulated data.

When using OpenRouter, requests are forwarded to the upstream provider implied by your model slug (e.g. `openai/...`, `anthropic/...`). Review both [OpenRouter's privacy policy](https://openrouter.ai/privacy) and the upstream provider's. Some OpenRouter model tiers may log prompts by default; users who need privacy guarantees should configure this via their OpenRouter account settings.


## 🙏 Credits
This tool was inspired by [Hou et al., Nature Methods 2024](https://www.nature.com/articles/s41592-024-02235-4) and [https://github.com/VPetukhov/GPTCellAnnotator](https://github.com/VPetukhov/GPTCellAnnotator).
Expand Down
3 changes: 2 additions & 1 deletion src/cell_annotator/_constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,10 @@ class PackageConstants:
"openai": "gpt-4o-mini",
"gemini": "gemini-2.5-flash-lite",
"anthropic": "claude-haiku-4-5",
"openrouter": "openai/gpt-4o-mini",
}
# Supported LLM providers
supported_providers: list[str] = ["openai", "gemini", "anthropic"]
supported_providers: list[str] = ["openai", "gemini", "anthropic", "openrouter"]
default_cluster_key: str = "leiden"
cell_type_key: str = "cell_type_harmonized"

Expand Down
69 changes: 49 additions & 20 deletions src/cell_annotator/model/_api_keys.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,22 +15,32 @@ class APIKeyManager:
on setup for different providers.
"""

# Provider configurations
# Provider configurations. ``model_keywords`` feed ``detect_provider_from_model``;
# OpenRouter is detected via the ``provider/model`` slash heuristic instead.
PROVIDER_CONFIG = {
"openai": {
"env_var": "OPENAI_API_KEY",
"setup_url": "https://platform.openai.com/api-keys",
"description": "OpenAI models (GPT, o1, etc.)",
"model_keywords": ("gpt", "o1", "davinci", "curie", "babbage", "ada"),
},
"gemini": {
"env_var": "GEMINI_API_KEY",
"setup_url": "https://aistudio.google.com/apikey",
"description": "Google Gemini models",
"model_keywords": ("gemini", "bison"),
},
"anthropic": {
"env_var": "ANTHROPIC_API_KEY",
"setup_url": "https://console.anthropic.com/settings/keys",
"description": "Anthropic Claude models",
"model_keywords": ("claude", "anthropic", "sonnet", "haiku", "opus"),
},
"openrouter": {
"env_var": "OPENROUTER_API_KEY",
"setup_url": "https://openrouter.ai/settings/keys",
"description": "OpenRouter models (aggregated providers)",
"model_keywords": (),
},
}

Expand Down Expand Up @@ -168,8 +178,6 @@ def validate_model_access(self, model: str) -> tuple[bool, str | None]:
"""
Check if a specific model is accessible by detecting its provider.

Uses heuristics to detect provider from model name.

Parameters
----------
model
Expand All @@ -179,23 +187,8 @@ def validate_model_access(self, model: str) -> tuple[bool, str | None]:
-------
Tuple of (is_accessible, provider_name)
"""
# Detect provider from model name using heuristics
model_lower = model.lower()

if any(gemini_name in model_lower for gemini_name in ["gemini", "bison"]):
provider = "gemini"
elif any(claude_name in model_lower for claude_name in ["claude", "anthropic"]):
provider = "anthropic"
elif any(openai_name in model_lower for openai_name in ["gpt", "o1", "davinci", "curie", "babbage", "ada"]):
provider = "openai"
else:
# Default to OpenAI for unknown models (most common)
provider = "openai"

if self.validate_provider(provider):
return True, provider
else:
return False, provider
provider = detect_provider_from_model(model)
return self.validate_provider(provider), provider

def check_and_warn(self, provider: str | None = None, model: str | None = None) -> bool:
"""
Expand Down Expand Up @@ -249,6 +242,42 @@ def check_and_warn(self, provider: str | None = None, model: str | None = None)
return False


def detect_provider_from_model(model: str) -> str:
"""
Auto-detect the LLM provider from a model name string.

OpenRouter slugs follow ``<provider>/<model>`` (e.g. ``openai/gpt-4o-mini``);
the ``models/`` prefix that Gemini IDs sometimes carry is excluded so it
does not false-match. Otherwise, match keywords from
``APIKeyManager.PROVIDER_CONFIG[*].model_keywords`` in priority order
(gemini, anthropic, openai). Defaults to ``"openai"`` if nothing matches.

Parameters
----------
model
Model name or slug.

Returns
-------
Provider name.
"""
model_lower = model.lower()

# OpenRouter uses '<provider>/<model>' slugs (e.g. 'openai/gpt-4o-mini').
# The 'models/' guard avoids false-matching Gemini IDs like 'models/gemini-1.5-flash'.
if "/" in model and not model_lower.startswith("models/"):
return "openrouter"

# Priority order matters: a model name like "ada-claude-experiment" should
# route to anthropic, not openai (anthropic-specific keywords win).
for provider in ("gemini", "anthropic", "openai"):
keywords = APIKeyManager.PROVIDER_CONFIG[provider].get("model_keywords", ())
if any(keyword in model_lower for keyword in keywords):
return provider

return "openai"


class APIKeyMixin:
"""Mixin class to add API key management capabilities to other classes."""

Expand Down
Loading
Loading