Self-hosted meeting transcription portal — turn long meeting recordings into usable deliverables on your own GPUs: corrected, speaker-attributed transcripts (SRT), structured summaries, quality reports and meeting-type-aware Word minutes. No cloud, no per-minute API bill, full data sovereignty.
🇫🇷 Interface et documentation actuellement en français — README français. The product is French-first today; UI strings and LLM prompts are centralized so localization is a planned evolution, not a rewrite.
Plenty of scripts wrap Whisper. TranscrIA is built as a service for teams that process real meetings, week after week:
- A real audio module, not an
ffmpegwrapper. Acoustic preflight (SNR, clipping, bandwidth, risk flags), speech/music/noise scene analysis, a per-window difficulty timeline shown to the user before transcription, optional Demucs source separation, loudness normalization, Silero VAD — all coordinated with GPU/VRAM management. - Human-in-the-loop where it matters. Detected speakers come with playable audio excerpts, talk time and an acoustic gender hint; users validate names, participants and a domain lexicon before the final pass. Known-voice matching is consent-based (signed form, hashed proof, source audio deleted by default).
- LLM arbitration with guardrails. A local OpenAI-compatible LLM (e.g. llama.cpp) produces the structured summary, corrects the SRT using the validated lexicon and context, and a final review pass harmonizes the deliverables — with anti-hallucination cleanup, retry-then-fail-loud semantics, and prompts editable in the admin UI.
- Production-grade orchestration. Persistent GPU job queue (priorities, anti-starvation aging, pause/resume, scheduled starts), VRAM-aware admission per remaining pipeline phase, calendar-based GPU scheduling, a resumable pipeline (checkpoint/resume — a re-queued job never redoes finished work), and "waiting for VRAM" as a first-class, admin-alerted state instead of a silent failure.
- Three deployment topologies. All-in-one box; CPU-only web frontend + GPU worker (shared PostgreSQL, job files replicated through the database — no NFS to operate, sha256-verified integrity); and a remote inference node serving STT/diarization/voice-embedding over HTTP with VRAM autonomy (reuse → launch on demand → explicit 503).
- Compliance by design. Multi-user RBAC (roles, groups), full GDPR audit trail (actor, IP, timestamp, filterable, exportable), consent-gated voice profiles, secrets kept out of the versioned config.
Home — jobs at a glance, one-click SRT / ZIP downloads
Processing profiles — pick your deliverable on a single slider right after upload; the portal pre-selects the most complete profile your hardware can run and hides the steps it doesn't need
Speaker validation — listen to excerpts, name speakers, acoustic gender hints
Configuration — detected hardware, friendly forms, LLM prompts editable in-app, full YAML for experts
GPU scheduling & queue — calendar windows (block night starts, cap concurrency), persistent queue with priorities
upload ─► audio diagnosis ─► quick summary (STT + LLM) ─► context, participants,
lexicon (human validation) ─► final pipeline:
preprocess → transcription → diarization → LLM correction → final review
→ quality scoring → exports (SRT, segments, quality report, DOCX minutes, ZIP)
- STT backends (interchangeable): Cohere transcribe (default), Whisper large-v3 / faster-whisper, IBM Granite Speech, NVIDIA Parakeet TDT (experimental) — served locally or by a remote OpenAI-compatible server (vLLM, SGLang…).
- Diarization backends: pyannote.audio (default) or NVIDIA Sortformer via NeMo.
- Word minutes adapted to 18 meeting types (works council, executive committee, project review, crisis…): LLM-extracted decisions/actions/votes, type-specific fields and visual themes, graceful degradation if extraction fails.
- Processing profiles (after upload): pick a deliverable on a single slider — from a quick
SRT expressto a fulldossier qualité— instead of an opaque fast/quality switch. The portal greys out profiles your hardware can't run, pre-selects the most complete one that fits, and then only executes the pipeline phases (and only reserves the GPU/LLM) that the chosen profile actually needs. - Every phase is checkpointed: a re-dispatched job resumes at the first incomplete phase, even on a different worker.
git clone https://github.com/Martossien/transcria.git
cd transcria
./install.sh # venv, dependencies, CUDA-matched PyTorch, config.yaml, optional systemd unitArbitration LLM, auto-selected by VRAM. During install, TranscrIA detects your GPUs and recommends the largest tier that actually fits (12 / 16 / 24 / 32 / 48 / 64 GB) — by real per-card placement (mono or split), not by total VRAM — and offers to download the right GGUF (with your HF token) and activate it — one prompt, no manual model-picking. Below 12 GB it falls back to raw transcription (no correction/summary LLM). The per-tier models are benchmarked in docs/BENCH_LLM_PALIERS.md; switch anytime with scripts/switch_arbitrage_llm.sh <tier>.
Still bring your own STT weights and pyannote cache (see docs/INSTALL.md), fill in config.yaml, then validate the install with the built-in preflight — no GPU needed, no side effects:
venv/bin/python scripts/doctor.py # config, DB schema, LLM server, opencode, nodes, storage
venv/bin/python scripts/doctor.py --strict # warnings become failures (for deployment gates)Start the service (./start.sh or systemd) and open the web UI. For distributed setups (web frontend + GPU worker, remote inference node), see docs/INSTALL.md §11–13 and docs/STOCKAGE_PARTAGE_JOBS.md.
Prefer containers? A turnkey script takes you from clone to a running stack — host GPU setup, secret/config generation, image build, docker compose up, health check — with no manual steps:
scripts/docker_quickstart.sh --bundled # EASIEST: try it — models baked in, no token, no download
scripts/docker_quickstart.sh # all-in-one GPU → http://localhost:7870
HF_TOKEN=hf_xxx scripts/docker_quickstart.sh # reference quality (gated Cohere STT + pyannote); omit for the no-token path
scripts/docker_quickstart.sh --cpu # no GPU (web + scheduler)
scripts/docker_quickstart.sh --down # stopDefault login: open
http://localhost:7870and sign in withadmin/CHANGE-ME(the initial credentials in the generatedconfig.yaml, keyauth.first_admin_password). Change the password before any real use — it's a placeholder, and a warning is logged while it stays at its default.
scripts/docker_quickstart.sh --bundled is the friendliest way to evaluate the project: it
pulls (or builds) the :bundled image with the default models already baked in — so there's
no Hugging Face token, no download, and it even works offline. You only need an NVIDIA GPU
(compute capability ≥ 7.5 — RTX 20xx or newer — and ≥ 12 GB VRAM) with Docker GPU access;
the script checks this up front and stops with a clear message if your card is too small.
⚠️ This is a quick-test image, not the complete project. To keep it token-free it uses the entry-level engines: transcription via Whisper, diarization via NVIDIA Sortformer (≤ 4 speakers, experimental), and the smallest 9B arbitration LLM. It exercises the full 6-profile workflow (summary / correction / review all run), but not the reference quality. For that — Cohere STT + pyannote (unlimited speakers) and larger LLM tiers — provide a freeHF_TOKEN(after accepting both models' conditions) or setTRANSCRIA_LLM_TIER. Nothing to reconfigure: the same command picks them up.
The all-in-one GPU image (CUDA 12.6) bundles the whole pipeline — STT, diarization and the arbitration LLM (a compiled llama-server serving a small non-gated GGUF). The default :latest (slim) image downloads those models at first run; :bundled bakes them in (zero download, no host-cache pitfalls — see the slim-vs-bundled table in the docs). With no token at all, the full 6-profile workflow runs (speaker labels via NVIDIA Sortformer, ≤4 speakers); a free HF token (plus accepting both model conditions) switches to reference quality (Cohere + pyannote, unlimited speakers). The slim image bakes no weights and is publishable — point the quickstart at a published image (TRANSCRIA_ALLINONE_IMAGE=ghcr.io/<owner>/transcria-allinone:vX) to pull instead of build. It is idempotent (never overwrites an existing config.yaml/.env) and validated end-to-end on GPU. GPU requirement: NVIDIA compute capability ≥ 7.5 (Turing or newer — RTX 20xx→50xx, A/L/H-series; Blackwell via PTX JIT) and ≥ 12 GB VRAM (the default 9B LLM peaks ~10.6 GB; phases are sequenced, not additive). Full reference — image, compose, GPU/VRAM compatibility table, variables, publishing, rollback — in docs/DOCKER.md.
| Layer | Technology |
|---|---|
| Backend | Python 3.11+, Flask 3, SQLAlchemy + Alembic (PostgreSQL in production, SQLite for dev) |
| STT serving | vLLM / SGLang / any OpenAI-compatible server; local engines |
| Diarization & voice | pyannote.audio, NVIDIA NeMo (Sortformer), local voice embeddings |
| LLM phases | opencode driving a local OpenAI-compatible LLM — selectable backend: Ollama / llama.cpp / vLLM, chosen per hardware from a data-driven profile catalog (docs/LLM_BACKENDS.md) |
| Audio | ffmpeg/ffprobe, Demucs, Silero VAD, SQUIM / DNSMOS quality metrics |
| Frontend | Server-rendered Jinja2 + Bootstrap 5, vanilla JS |
| Exports | python-docx (themed minutes), SRT, JSON, ZIP package |
v0.1.0-beta.7. New in beta.7: the arbitration LLM is now multi-backend — Ollama / llama.cpp / vLLM — chosen automatically from the physical hardware via a single data-driven profile catalog (transcria/data/llm_profiles.yaml, no model size hardcoded; VRAM footprint derived from the real model and re-measured on first load). Ollama becomes the "easy" backend (curl | sh, no compilation, no nvcc, no HF token); llama.cpp gains a prebuilt CUDA binary path (ai-dock, sha256-verified — installs on a blank distro without compiling); see docs/LLM_BACKENDS.md. The product is functional and covered by 2,983 tests (green CI: ruff, mypy, full pytest on PostgreSQL, ~80 % coverage). The installer is validated end-to-end on Ubuntu 22.04/24.04, Debian 12, Fedora 41, Rocky 9 × Python 3.11–3.13 (apt + dnf), full pipeline STT + diarization + LLM, across mono- and multi-GPU (Ollama 12B/35B, llama.cpp 35B-A3B, vLLM 27B-FP8) — see docs/LLM_PROFILS_VALIDATION.md. The distributed topology (CPU frontend + GPU resource node) is validated end-to-end on real audio with a vLLM arbitration LLM (tensor-parallel) and automatic VRAM placement across 8 GPUs — see docs/DOCKER.md and docs/PLAN_TEST_SPLIT_VLLM.md. Concurrency hardened under load (robust up to 8 concurrent jobs — see docs/PLAN_TEST_CHARGE.md). Following SemVer, the 0.x series is a stabilization phase: the API, the configuration schema and the data model may still change without backward-compatibility guarantees until 1.0.0. Evaluate it, pilot it — don't bet production on it without your own validation. A containerized deployment (Dockerfile, compose, GPU support, turnkey quickstart) is available — see docs/DOCKER.md.
Language: the UI and the LLM prompts are French-first (the pipeline is tuned for French meetings). Both are centralized/editable, so adding languages is a planned evolution, not a rewrite.
Full documentation lives in docs/ (currently in French):
| Document | Content |
|---|---|
| docs/INSTALL.md | Installation, models, systemd, troubleshooting, distributed deployment |
| docs/DOCKER.md | Containerized deployment — turnkey quickstart, image, compose, GPU (CDI), variables, rollback |
| docs/TECHNICAL.md | Architecture, pipeline, API, GPU orchestration |
| docs/CONFIG_REFERENCE.md | Complete config.yaml reference |
| docs/DATA_MODEL.md | DB schema, job states, files per job |
| docs/SERVICE_RESSOURCES_GPU.md | Remote inference, VRAM autonomy, degraded modes |
| docs/STOCKAGE_PARTAGE_JOBS.md | PostgreSQL-backed job file store for split deployments |
| CONTRIBUTING.md · SECURITY.md · CHANGELOG.md | Contributing, security policy, changelog |
TranscrIA is released under the Apache License 2.0. Third-party components (bundled libraries and binaries, and runtime-downloaded models) and their licenses / attributions are listed in THIRD_PARTY_NOTICES.md — including the CC-BY-4.0 attribution for the DNSMOS/SQUIM quality models, and the licenses of components shipped in the Docker images (opencode — MIT, ffmpeg — GPL/LGPL via Debian, etc.). No GPL/AGPL (strong copyleft) dependency is present at runtime.






