Skip to content

cofin/mogemma

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔥 MoGemma

Python/Mojo interface for Google Gemma 3.

Features

  • Embeddings — Dense vector embeddings via a pure Mojo backend.
  • Text generation — Synchronous and async streaming with configurable sampling.
  • Google Cloud Storage — Automatic model download from Google's gemma-data bucket.
  • OpenTelemetry — Optional tracing instrumentation.

Installation

pip install mogemma

For text generation (requires tokenizer):

pip install 'mogemma[llm]'

Quick Start

Text Generation

from mogemma import SyncGemmaModel

model = SyncGemmaModel()
print(model.generate("Write a haiku about a robot discovering coffee:"))

Async Streaming

import asyncio
from mogemma import AsyncGemmaModel

async def main():
    model = AsyncGemmaModel()
    async for token in model.generate_stream("Once upon a time"):
        print(token, end="", flush=True)

asyncio.run(main())

Embeddings

Generate dense vector embeddings natively through Mojo's optimized batched kernel operations. Pass a single string or a list of strings to process them in parallel.

from mogemma import EmbeddingModel

model = EmbeddingModel()
embeddings = model.embed(["Hello, world!", "Mojo runs Gemma inference."])
print(embeddings.shape)  # (2, 768)

Selecting a Model Variant

All model classes default to gemma3-270m-it. Pass a model ID to use a different variant:

model = SyncGemmaModel("gemma3-1b-it")

For full control over sampling parameters, pass a GenerationConfig:

from mogemma import GenerationConfig, SyncGemmaModel

config = GenerationConfig(model_path="gemma3-1b-it", temperature=0.7)
model = SyncGemmaModel(config)

Device Selection

GenerationConfig and EmbeddingConfig accept:

  • device="cpu"
  • device="gpu"
  • device="gpu:0" (or other index)

Device handling is deterministic:

  • device="cpu" always runs on CPU
  • explicit GPU requests never silently fall back to CPU
  • unavailable GPU requests raise an explicit error

Current runtime status:

  • cpu and gpu are executable backends today
  • gpu / gpu:N execute via a mathematically verified runtime polyfill
from mogemma import EmbeddingConfig, EmbeddingModel, GenerationConfig, SyncGemmaModel

generation = SyncGemmaModel(
    GenerationConfig(
        model_path="gemma3-1b-it",
        device="cpu",
    )
)

embeddings = EmbeddingModel(
    EmbeddingConfig(
        model_path="gemma3-1b-it",
        device="cpu",
    )
)

Explicit GPU requests are validated strictly:

from mogemma import GenerationConfig, SyncGemmaModel

config = GenerationConfig(
    model_path="gemma3-1b-it",
    device="gpu:0",
)
model = SyncGemmaModel(config)

License

MIT

About

🔥 Python / Mojo Interface for Google Gemma 3

Topics

Resources

License

Stars

Watchers

Forks

Contributors