🔥 MoGemma

Python/Mojo interface for Google Gemma 3.

Features

Embeddings — Dense vector embeddings via a pure Mojo backend.
Text generation — Synchronous and async streaming with configurable sampling.
Google Cloud Storage — Automatic model download from Google's gemma-data bucket.
OpenTelemetry — Optional tracing instrumentation.

Installation

pip install mogemma

For text generation (requires tokenizer):

pip install 'mogemma[llm]'

Quick Start

Text Generation

from mogemma import SyncGemmaModel

model = SyncGemmaModel()
print(model.generate("Write a haiku about a robot discovering coffee:"))

Async Streaming

import asyncio
from mogemma import AsyncGemmaModel

async def main():
    model = AsyncGemmaModel()
    async for token in model.generate_stream("Once upon a time"):
        print(token, end="", flush=True)

asyncio.run(main())

Embeddings

Generate dense vector embeddings natively through Mojo's optimized batched kernel operations. Pass a single string or a list of strings to process them in parallel.

from mogemma import EmbeddingModel

model = EmbeddingModel()
embeddings = model.embed(["Hello, world!", "Mojo runs Gemma inference."])
print(embeddings.shape)  # (2, 768)

Selecting a Model Variant

All model classes default to gemma3-270m-it. Pass a model ID to use a different variant:

model = SyncGemmaModel("gemma3-1b-it")

For full control over sampling parameters, pass a GenerationConfig:

from mogemma import GenerationConfig, SyncGemmaModel

config = GenerationConfig(model_path="gemma3-1b-it", temperature=0.7)
model = SyncGemmaModel(config)

Device Selection

GenerationConfig and EmbeddingConfig accept:

device="cpu"
device="gpu"
device="gpu:0" (or other index)

Device handling is deterministic:

device="cpu" always runs on CPU
explicit GPU requests never silently fall back to CPU
unavailable GPU requests raise an explicit error

Current runtime status:

cpu and gpu are executable backends today
gpu / gpu:N execute via a mathematically verified runtime polyfill

from mogemma import EmbeddingConfig, EmbeddingModel, GenerationConfig, SyncGemmaModel

generation = SyncGemmaModel(
    GenerationConfig(
        model_path="gemma3-1b-it",
        device="cpu",
    )
)

embeddings = EmbeddingModel(
    EmbeddingConfig(
        model_path="gemma3-1b-it",
        device="cpu",
    )
)

Explicit GPU requests are validated strictly:

from mogemma import GenerationConfig, SyncGemmaModel

config = GenerationConfig(
    model_path="gemma3-1b-it",
    device="gpu:0",
)
model = SyncGemmaModel(config)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
docs		docs
src		src
tools		tools
.coverage		.coverage
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔥 MoGemma

Features

Installation

Quick Start

Text Generation

Async Streaming

Embeddings

Selecting a Model Variant

Device Selection

License

About

Uh oh!

Releases 4

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔥 MoGemma

Features

Installation

Quick Start

Text Generation

Async Streaming

Embeddings

Selecting a Model Variant

Device Selection

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Contributors

Uh oh!

Languages