Skip to content

Create IBM Granite Speech Models MCP Server #907

@crivetimihai

Description

@crivetimihai

Overview

Create a comprehensive MCP Server for IBM Granite Speech Models supporting speech recognition, translation, synthesis, and audio processing with multi-language capabilities and enterprise-grade features.

Server Specifications

Server Details

  • Name: granite-speech-server
  • Language: Python 3.11+
  • Location: mcp-servers/python/granite_speech_server/
  • Purpose: Provide access to IBM Granite speech models via MCP

Supported Models

From IBM Granite Speech on Hugging Face:

  • Speech Recognition: granite-speech-asr-en, granite-whisper-multilingual
  • Translation: granite-speech-translate-en-es, granite-speech-translate-en-fr
  • Synthesis: granite-speech-tts-v1, granite-voice-cloning
  • Processing: granite-speech-enhancement, granite-audio-classification

Language Support

English to 7+ languages: Spanish, French, German, Italian, Portuguese, Japanese, Chinese

Provider Support

  • Ollama: Local inference with speech models
  • watsonx.ai: IBM's enterprise speech services
  • Hugging Face: Direct access to speech transformers
  • Custom Endpoints: Flexible audio API integration

Tools Provided

1. transcribe_audio

Speech-to-text transcription with language detection

@dataclass
class TranscriptionRequest:
    audio_data: str  # base64, file path, or URL
    model: str = "granite-speech-asr-en"
    provider: str = "ollama"
    language: str = "auto"  # auto-detect or specify
    enable_timestamps: bool = True
    speaker_diarization: bool = False
    noise_reduction: bool = True
    output_format: str = "text"  # text, srt, vtt, json

2. translate_speech

Real-time speech translation between languages

@dataclass
class SpeechTranslationRequest:
    audio_data: str
    model: str = "granite-speech-translate-en-es"
    provider: str = "watsonx"
    source_language: str = "en"
    target_language: str = "es"
    preserve_speaker_style: bool = True
    include_confidence: bool = True
    streaming: bool = False

3. synthesize_speech

Text-to-speech synthesis with voice customization

@dataclass
class SpeechSynthesisRequest:
    text: str
    model: str = "granite-speech-tts-v1"
    provider: str = "huggingface"
    voice: str = "default"  # voice profile
    language: str = "en"
    speed: float = 1.0
    pitch: float = 1.0
    emotion: str = "neutral"  # neutral, happy, sad, excited
    output_format: str = "wav"  # wav, mp3, ogg

4. enhance_audio

Audio enhancement and noise reduction

@dataclass
class AudioEnhancementRequest:
    audio_data: str
    model: str = "granite-speech-enhancement"
    provider: str = "ollama"
    enhancement_type: str = "denoise"  # denoise, enhance, normalize, amplify
    preserve_speech: bool = True
    aggressive_filtering: bool = False
    output_quality: str = "high"  # low, medium, high

5. detect_language

Audio language identification and classification

@dataclass
class LanguageDetectionRequest:
    audio_data: str
    model: str = "granite-audio-classification"
    provider: str = "watsonx"
    confidence_threshold: float = 0.8
    return_all_probabilities: bool = False
    segment_analysis: bool = False  # Analyze segments for language switching

6. analyze_speaker

Speaker identification and voice characteristics

@dataclass
class SpeakerAnalysisRequest:
    audio_data: str
    model: str = "granite-audio-classification"
    provider: str = "huggingface"
    analysis_type: str = "speaker_id"  # speaker_id, emotion, gender, age
    reference_speakers: Optional[List[str]] = None
    include_embeddings: bool = False

7. real_time_transcription

Streaming audio transcription and processing

@dataclass
class StreamingTranscriptionRequest:
    audio_stream: str  # streaming audio source
    model: str = "granite-speech-asr-en"
    provider: str = "ollama"
    chunk_duration: float = 2.0  # seconds
    overlap_duration: float = 0.5
    language: str = "auto"
    live_translation: Optional[str] = None  # target language

8. batch_audio_processing

Efficient batch processing of multiple audio files

@dataclass
class BatchAudioRequest:
    audio_files: List[str]
    model: str = "granite-speech-asr-en"
    provider: str = "watsonx"
    processing_type: str = "transcribe"  # transcribe, translate, enhance, analyze
    parallel_processing: bool = True
    max_concurrent: int = 4
    output_format: str = "json"

Implementation Requirements

Directory Structure

mcp-servers/python/granite_speech_server/
├── src/
│   └── granite_speech_server/
│       ├── __init__.py
│       ├── server.py
│       ├── providers/
│       │   ├── __init__.py
│       │   ├── ollama_speech.py
│       │   ├── watsonx_speech.py
│       │   ├── huggingface_speech.py
│       │   └── streaming_client.py
│       ├── models/
│       │   ├── __init__.py
│       │   ├── granite_speech_models.py
│       │   └── language_models.py
│       ├── processing/
│       │   ├── __init__.py
│       │   ├── audio_processor.py
│       │   ├── transcription.py
│       │   ├── translation.py
│       │   ├── synthesis.py
│       │   └── enhancement.py
│       ├── tools/
│       │   ├── __init__.py
│       │   ├── speech_recognition.py
│       │   ├── speech_translation.py
│       │   ├── text_to_speech.py
│       │   └── audio_analysis.py
│       └── utils/
│           ├── __init__.py
│           ├── audio_utils.py
│           ├── format_converters.py
│           └── streaming_utils.py
├── tests/
├── requirements.txt
├── README.md
└── examples/
    ├── transcription_example.py
    ├── translation_example.py
    └── real_time_processing.py

Dependencies

# requirements.txt
mcp>=1.0.0
transformers>=4.35.0
torch>=2.1.0
torchaudio>=2.1.0
librosa>=0.10.0
soundfile>=0.12.1
pyaudio>=0.2.11
speechrecognition>=3.10.0
pydub>=0.25.1
webrtcvad>=2.0.10
requests>=2.31.0
pydantic>=2.5.0
ollama>=0.1.7
ibm-watson>=7.0.0
numpy>=1.24.0
scipy>=1.11.0

Configuration

# config.yaml
providers:
  ollama:
    base_url: "http://localhost:11434"
    speech_models_enabled: true
    timeout: 300
    
  watsonx:
    url: "https://api.us-south.speech-to-text.watson.cloud.ibm.com"
    apikey: "${WATSONX_SPEECH_API_KEY}"
    region: "us-south"
    
  huggingface:
    api_key: "${HF_API_KEY}"
    cache_dir: "./hf_speech_cache"
    device: "auto"

models:
  default_asr: "granite-speech-asr-en"
  default_translation: "granite-speech-translate-en-es"
  default_tts: "granite-speech-tts-v1"
  default_enhancement: "granite-speech-enhancement"

audio:
  sample_rate: 16000
  channels: 1
  max_file_size: "100MB"
  supported_formats: ["wav", "mp3", "ogg", "flac", "m4a"]
  temp_dir: "./temp_audio"
  chunk_size: 1024

processing:
  enable_gpu: true
  max_concurrent_requests: 6
  streaming_buffer_size: 4096
  silence_threshold: 0.01
  
languages:
  supported: ["en", "es", "fr", "de", "it", "pt", "ja", "zh"]
  auto_detection: true
  confidence_threshold: 0.7

Usage Examples

Audio Transcription

# Transcribe audio file to text
result = await mcp_client.call_tool("transcribe_audio", {
    "audio_data": "./meeting_recording.wav",
    "model": "granite-speech-asr-en",
    "provider": "ollama",
    "language": "auto",
    "enable_timestamps": True,
    "speaker_diarization": True,
    "output_format": "srt"
})

Speech Translation

# Translate speech from English to Spanish
result = await mcp_client.call_tool("translate_speech", {
    "audio_data": "data:audio/wav;base64,UklGRnoGAABXQVZFZm10IBAAAAABAAEA...",
    "model": "granite-speech-translate-en-es",
    "provider": "watsonx",
    "source_language": "en",
    "target_language": "es",
    "include_confidence": True
})

Text-to-Speech Synthesis

# Convert text to natural speech
result = await mcp_client.call_tool("synthesize_speech", {
    "text": "Welcome to our multilingual customer service. How may I help you today?",
    "model": "granite-speech-tts-v1",
    "provider": "huggingface",
    "voice": "professional_female",
    "language": "en",
    "emotion": "friendly",
    "output_format": "mp3"
})

Real-time Processing

# Real-time speech transcription
result = await mcp_client.call_tool("real_time_transcription", {
    "audio_stream": "rtmp://live.example.com/stream",
    "model": "granite-speech-asr-en",
    "provider": "ollama",
    "chunk_duration": 2.0,
    "language": "auto",
    "live_translation": "es"
})

Audio Enhancement

# Enhance audio quality
result = await mcp_client.call_tool("enhance_audio", {
    "audio_data": "./noisy_recording.wav",
    "model": "granite-speech-enhancement",
    "provider": "watsonx",
    "enhancement_type": "denoise",
    "preserve_speech": True,
    "output_quality": "high"
})

Batch Processing

# Process multiple audio files
result = await mcp_client.call_tool("batch_audio_processing", {
    "audio_files": [
        "./audio1.wav",
        "./audio2.mp3", 
        "./audio3.flac"
    ],
    "model": "granite-speech-asr-en",
    "provider": "ollama",
    "processing_type": "transcribe",
    "parallel_processing": True,
    "output_format": "json"
})

Advanced Features

  • Real-time Streaming: Live audio processing and transcription
  • Speaker Diarization: Multi-speaker identification and separation
  • Emotion Recognition: Detect emotional states in speech
  • Voice Cloning: Custom voice synthesis from samples
  • Audio Fingerprinting: Content identification and matching
  • Noise Cancellation: Advanced audio enhancement

Enterprise Features

  • High Accuracy Models: Enterprise-optimized speech recognition
  • Custom Vocabulary: Domain-specific terminology support
  • Compliance: SOX, GDPR compliance for audio data
  • Scalability: Handle high-volume audio processing
  • Integration: APIs for call center and meeting platforms
  • Security: Encrypted audio processing and storage

Performance Optimizations

  • GPU Acceleration: CUDA support for faster processing
  • Streaming Architecture: Minimize latency for real-time applications
  • Model Quantization: Optimized models for deployment
  • Caching: Intelligent caching of processed audio segments
  • Batch Processing: Efficient multi-file processing

Acceptance Criteria

  • Python MCP server with 8+ Granite speech model tools
  • Support for all major Granite speech models
  • Multi-provider integration (Ollama, watsonx.ai, Hugging Face)
  • Speech recognition with multi-language support
  • Speech translation capabilities (English to 7+ languages)
  • Text-to-speech synthesis with voice customization
  • Real-time streaming audio processing
  • Audio enhancement and noise reduction
  • Batch processing for efficiency
  • Multi-format audio support
  • Comprehensive test suite with sample audio (>90% coverage)
  • Complete documentation with speech processing examples

Priority

High - Demonstrates IBM's advanced speech AI capabilities via MCP

Use Cases

  • Multilingual customer service and support
  • Real-time meeting transcription and translation
  • Voice-enabled applications and assistants
  • Audio content accessibility and processing
  • Call center analytics and quality monitoring
  • Language learning and pronunciation tools
  • Audio content localization and dubbing
  • Voice biometrics and speaker verification

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestmcp-serversMCP Server SamplesoicOpen Innovation Community ContributionspythonPython / backend development (FastAPI)

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions