Python/Mojo interface for Google Gemma 3.
- Embeddings — Dense vector embeddings via a pure Mojo backend.
- Text generation — Synchronous and async streaming with configurable sampling.
- Google Cloud Storage — Automatic model download from Google's
gemma-databucket. - OpenTelemetry — Optional tracing instrumentation.
pip install mogemmaFor text generation (requires tokenizer):
pip install 'mogemma[llm]'from mogemma import SyncGemmaModel
model = SyncGemmaModel()
print(model.generate("Write a haiku about a robot discovering coffee:"))import asyncio
from mogemma import AsyncGemmaModel
async def main():
model = AsyncGemmaModel()
async for token in model.generate_stream("Once upon a time"):
print(token, end="", flush=True)
asyncio.run(main())Generate dense vector embeddings natively through Mojo's optimized batched kernel operations. Pass a single string or a list of strings to process them in parallel.
from mogemma import EmbeddingModel
model = EmbeddingModel()
embeddings = model.embed(["Hello, world!", "Mojo runs Gemma inference."])
print(embeddings.shape) # (2, 768)All model classes default to gemma3-270m-it. Pass a model ID to use a different variant:
model = SyncGemmaModel("gemma3-1b-it")For full control over sampling parameters, pass a GenerationConfig:
from mogemma import GenerationConfig, SyncGemmaModel
config = GenerationConfig(model_path="gemma3-1b-it", temperature=0.7)
model = SyncGemmaModel(config)GenerationConfig and EmbeddingConfig accept:
device="cpu"device="gpu"device="gpu:0"(or other index)
Device handling is deterministic:
device="cpu"always runs on CPU- explicit GPU requests never silently fall back to CPU
- unavailable GPU requests raise an explicit error
Current runtime status:
cpuandgpuare executable backends todaygpu/gpu:Nexecute via a mathematically verified runtime polyfill
from mogemma import EmbeddingConfig, EmbeddingModel, GenerationConfig, SyncGemmaModel
generation = SyncGemmaModel(
GenerationConfig(
model_path="gemma3-1b-it",
device="cpu",
)
)
embeddings = EmbeddingModel(
EmbeddingConfig(
model_path="gemma3-1b-it",
device="cpu",
)
)Explicit GPU requests are validated strictly:
from mogemma import GenerationConfig, SyncGemmaModel
config = GenerationConfig(
model_path="gemma3-1b-it",
device="gpu:0",
)
model = SyncGemmaModel(config)MIT