Skip to content

Add Prompt Authoring Tools Category to MCP Eval ServerΒ #896

@crivetimihai

Description

@crivetimihai

Overview

Add a new Prompt Authoring category to the MCP Eval Server with specialized tools for generating, optimizing, and authoring prompts for AI Assistants, Agents, and various AI applications.

Enhancement Scope

Current State

  • 63 tools across 14 categories focused on evaluation
  • Strong evaluation capabilities but no prompt generation/authoring tools
  • Comprehensive assessment framework ready for extension

Proposed Addition: Prompt Authoring Category

Add 8 new tools in a 15th category specifically for prompt creation and optimization.

New Tool Category: 🎨 Prompt Authoring Tools (8 tools)

πŸ“ Prompt Generation Tools (4 tools)

1. prompt_authoring.generate_system_prompt

Purpose: Generate system prompts for AI assistants and agents
Features:

  • Role-based prompt templates (assistant, agent, specialist, analyst)
  • Personality and tone configuration
  • Capability definition and boundaries
  • Context window optimization
  • Multi-modal prompt support (text, vision, code)

Parameters:

{
    "role_type": "assistant|agent|specialist|analyst|custom",
    "domain": "general|technical|creative|business|academic|medical",
    "personality_traits": ["helpful", "analytical", "creative", "professional"],
    "capabilities": ["reasoning", "search", "code", "math", "vision"],
    "constraints": ["safety_focused", "factual_only", "cite_sources"],
    "interaction_style": "conversational|formal|technical|friendly",
    "context_length": "short|medium|long|comprehensive",
    "examples_count": 3,
    "include_guidelines": true,
    "custom_instructions": "Additional specific requirements..."
}

2. prompt_authoring.create_task_prompt

Purpose: Generate task-specific prompts for complex operations
Features:

  • Task decomposition and step-by-step instructions
  • Input/output format specification
  • Error handling and edge case coverage
  • Quality criteria definition
  • Chain-of-thought integration

Parameters:

{
    "task_type": "analysis|synthesis|creative|problem_solving|research",
    "complexity_level": "simple|intermediate|advanced|expert",
    "input_format": "text|json|csv|code|mixed",
    "output_format": "text|json|markdown|structured|custom",
    "include_cot": true,
    "step_by_step": true,
    "include_examples": true,
    "error_handling": "graceful|strict|verbose",
    "quality_criteria": ["accuracy", "completeness", "clarity"],
    "context_requirements": "Domain-specific context needed..."
}

3. prompt_authoring.generate_few_shot_examples

Purpose: Create optimized few-shot examples for prompt enhancement
Features:

  • Diverse example generation across input space
  • Quality scoring and selection
  • Edge case coverage
  • Format consistency validation
  • Difficulty progression

Parameters:

{
    "task_description": "Task for which to generate examples",
    "example_count": 5,
    "difficulty_levels": ["easy", "medium", "hard"],
    "diversity_focus": "input_variety|output_variety|edge_cases|balanced",
    "format_style": "conversational|structured|technical|creative",
    "include_explanations": true,
    "validate_consistency": true,
    "domain_context": "Specific domain requirements..."
}

4. prompt_authoring.create_agent_instructions

Purpose: Generate comprehensive instructions for AI agents with tools
Features:

  • Tool usage guidelines and best practices
  • Decision-making frameworks
  • Workflow orchestration instructions
  • Error recovery procedures
  • Performance optimization tips

Parameters:

{
    "agent_type": "research|coding|analysis|creative|support",
    "available_tools": ["search", "calculator", "code_executor", "file_manager"],
    "workflow_complexity": "linear|branching|iterative|adaptive",
    "decision_framework": "rule_based|heuristic|ml_guided|hybrid",
    "autonomy_level": "supervised|semi_autonomous|fully_autonomous",
    "error_recovery": "retry|escalate|alternative_path|user_query",
    "performance_focus": "speed|accuracy|cost_efficiency|user_satisfaction",
    "collaboration_mode": "solo|human_in_loop|multi_agent"
}

πŸ”§ Prompt Optimization Tools (4 tools)

5. prompt_authoring.optimize_for_model

Purpose: Optimize prompts for specific language models
Features:

  • Model-specific formatting and structure
  • Token efficiency optimization
  • Context window utilization
  • Temperature and parameter recommendations
  • Bias mitigation strategies

Parameters:

{
    "target_model": "gpt-4|gpt-3.5|claude|llama|custom",
    "prompt_text": "Original prompt to optimize",
    "optimization_goals": ["token_efficiency", "clarity", "specificity", "bias_reduction"],
    "context_length_target": "short|optimal|maximum",
    "preserve_intent": true,
    "include_parameter_recommendations": true,
    "bias_mitigation": true,
    "performance_focus": "accuracy|speed|cost|balanced"
}

6. prompt_authoring.enhance_clarity_and_specificity

Purpose: Improve prompt clarity and reduce ambiguity
Features:

  • Ambiguity detection and resolution
  • Specificity enhancement suggestions
  • Language simplification options
  • Structure optimization
  • Readability improvement

Parameters:

{
    "prompt_text": "Prompt to enhance for clarity",
    "target_audience": "general|technical|domain_expert|novice",
    "clarity_focus": ["remove_ambiguity", "add_specificity", "improve_structure"],
    "simplification_level": "none|light|moderate|extensive",
    "preserve_technical_terms": true,
    "add_examples": true,
    "structure_improvements": true,
    "readability_target": "grade_8|grade_12|college|professional"
}

7. prompt_authoring.generate_prompt_variants

Purpose: Create multiple variations of a prompt for A/B testing
Features:

  • Systematic variation generation
  • Style and tone alternatives
  • Structure modifications
  • Length variations
  • Approach diversification

Parameters:

{
    "base_prompt": "Original prompt to vary",
    "variation_count": 5,
    "variation_types": ["tone", "structure", "length", "examples", "approach"],
    "tone_variations": ["formal", "casual", "technical", "friendly"],
    "structure_variations": ["step_by_step", "bullet_points", "narrative", "qa_format"],
    "length_variations": ["concise", "detailed", "comprehensive"],
    "preserve_core_intent": true,
    "include_metadata": true,
    "optimization_focus": "performance|diversity|creativity"
}

8. prompt_authoring.validate_prompt_quality

Purpose: Comprehensive prompt quality assessment and improvement suggestions
Features:

  • Multi-dimensional quality scoring
  • Improvement recommendations
  • Bias detection
  • Completeness analysis
  • Best practices compliance

Parameters:

{
    "prompt_text": "Prompt to validate and assess",
    "quality_dimensions": ["clarity", "specificity", "completeness", "bias", "structure"],
    "target_use_case": "assistant|agent|creative|analytical|technical",
    "provide_improvements": true,
    "check_bias": true,
    "validate_examples": true,
    "best_practices_check": true,
    "generate_score": true,
    "detailed_feedback": true
}

Implementation Requirements

File Structure

mcp_eval_server/
β”œβ”€β”€ src/
β”‚   └── mcp_eval_server/
β”‚       β”œβ”€β”€ tools/
β”‚       β”‚   └── prompt_authoring/          # New category
β”‚       β”‚       β”œβ”€β”€ __init__.py
β”‚       β”‚       β”œβ”€β”€ generation.py          # Generation tools (1-4)
β”‚       β”‚       β”œβ”€β”€ optimization.py        # Optimization tools (5-8)  
β”‚       β”‚       β”œβ”€β”€ templates/             # Prompt templates
β”‚       β”‚       β”‚   β”œβ”€β”€ system_prompts.yaml
β”‚       β”‚       β”‚   β”œβ”€β”€ task_templates.yaml
β”‚       β”‚       β”‚   β”œβ”€β”€ agent_instructions.yaml
β”‚       β”‚       β”‚   └── examples.yaml
β”‚       β”‚       └── utils/                 # Utilities
β”‚       β”‚           β”œβ”€β”€ template_engine.py
β”‚       β”‚           β”œβ”€β”€ quality_scorer.py
β”‚       β”‚           └── model_optimizer.py
β”‚       └── server.py                      # Updated to include new tools
└── README.md                              # Updated with new category

Integration Points

  1. Server Registration: Add prompt authoring tools to server tool registry
  2. Documentation: Update README.md with new category and tools
  3. Templates: Create comprehensive prompt template library
  4. Quality Framework: Integrate with existing quality assessment tools
  5. Model Integration: Leverage existing judge models for optimization

Template Library Structure

# system_prompts.yaml
assistant_templates:
  general_assistant:
    base_prompt: "You are a helpful, harmless, and honest AI assistant..."
    personality_options: ["professional", "friendly", "analytical"]
    capability_modules: ["reasoning", "search", "math"]
  
  technical_specialist:
    base_prompt: "You are a technical specialist with deep expertise..."
    domains: ["software_engineering", "data_science", "devops"]
    
agent_templates:
  research_agent:
    workflow_prompt: "As a research agent, follow this systematic process..."
    tool_usage_guidelines: "Use tools in the following priority order..."
    
  coding_agent:
    base_instructions: "You are an expert coding agent that..."
    best_practices: ["test_driven", "documentation", "security"]

Quality Assessment Integration

  • Cross-reference with existing tools: Use current evaluation tools to assess generated prompts
  • Feedback loop: Generated prompts can be evaluated using existing quality tools
  • Optimization cycle: Iterative improvement using evaluation feedback
  • Validation pipeline: Automated testing of generated prompts

Usage Examples

Example 1: Generate System Prompt for Technical Assistant

# MCP Client usage
result = await mcp_client.call_tool("prompt_authoring.generate_system_prompt", {
    "role_type": "specialist",
    "domain": "technical", 
    "personality_traits": ["analytical", "precise", "helpful"],
    "capabilities": ["reasoning", "code", "math"],
    "constraints": ["cite_sources", "factual_only"],
    "interaction_style": "technical",
    "include_guidelines": True,
    "examples_count": 3
})

# Result includes optimized system prompt ready for use
print(result["generated_prompt"])

Example 2: Create Task-Specific Prompt

# Generate prompt for data analysis task
result = await mcp_client.call_tool("prompt_authoring.create_task_prompt", {
    "task_type": "analysis",
    "complexity_level": "advanced", 
    "input_format": "csv",
    "output_format": "structured",
    "include_cot": True,
    "step_by_step": True,
    "quality_criteria": ["accuracy", "completeness", "insights"]
})

Example 3: Optimize Prompt for Specific Model

# Optimize existing prompt for GPT-4
result = await mcp_client.call_tool("prompt_authoring.optimize_for_model", {
    "target_model": "gpt-4",
    "prompt_text": "Write a detailed analysis of...",
    "optimization_goals": ["token_efficiency", "clarity", "specificity"],
    "context_length_target": "optimal",
    "include_parameter_recommendations": True
})

Benefits

For Developers

  • Rapid Prompt Development: Generate high-quality prompts quickly
  • Best Practices: Built-in prompt engineering best practices
  • Model Optimization: Automatically optimize for different models
  • Quality Assurance: Validate and improve prompt quality

For Researchers

  • Systematic Prompt Creation: Structured approach to prompt development
  • A/B Testing: Generate variations for comparative analysis
  • Bias Mitigation: Built-in bias detection and mitigation
  • Reproducibility: Consistent prompt generation with metadata

for Enterprise

  • Standardization: Consistent prompt quality across applications
  • Efficiency: Reduce time spent on manual prompt crafting
  • Integration: Works with existing MCP infrastructure
  • Scalability: Generate prompts at scale for multiple use cases

Testing Strategy

  • Unit Tests: Individual tool functionality
  • Integration Tests: Cross-tool compatibility
  • Quality Tests: Generated prompt effectiveness
  • Performance Tests: Generation speed and resource usage
  • Validation Tests: Template quality and completeness

Documentation Updates

README.md Updates

  1. Tool count: Update from "63 tools" to "71 tools across 15 categories"
  2. New category section: Add Prompt Authoring Tools description
  3. Quick start examples: Include prompt authoring examples
  4. Tool reference table: Add all 8 new tools

Example Updated Tool Overview

πŸ† MCP EVALUATION SERVER - 71 SPECIALIZED TOOLS πŸ†
═══════════════════════════════════════════════════════════

πŸ“Š CORE EVALUATION SUITE (15 tools)
β”œβ”€β”€ πŸ€– Judge Tools (4) ────── LLM-as-a-judge evaluation  
β”œβ”€β”€ πŸ“ Prompt Tools (4) ───── Clarity, consistency, optimization
β”œβ”€β”€ πŸ› οΈ Agent Tools (4) ────── Performance, reasoning, benchmarking
└── πŸ” Quality Tools (3) ──── Factuality, coherence, toxicity

πŸ”¬ ADVANCED ASSESSMENT SUITE (39 tools)
β”œβ”€β”€ πŸ”— RAG Tools (8) ──────── Retrieval relevance, grounding, citations
β”œβ”€β”€ βš–οΈ Bias & Fairness (6) ── Demographic bias, intersectional analysis  
β”œβ”€β”€ πŸ›‘οΈ Robustness (5) ──────── Adversarial testing, injection resistance
β”œβ”€β”€ πŸ”’ Safety & Alignment (4) Harmful content, value alignment
β”œβ”€β”€ 🌍 Multilingual (4) ────── Translation, cultural adaptation
β”œβ”€β”€ ⚑ Performance (4) ──────── Latency, efficiency, scaling
└── πŸ” Privacy (8) ───────── PII detection, compliance, anonymization

🎨 PROMPT AUTHORING SUITE (8 tools)  # NEW!
β”œβ”€β”€ πŸ“ Generation Tools (4) ── System prompts, task prompts, examples, agent instructions
└── πŸ”§ Optimization Tools (4) Model optimization, clarity enhancement, variants, validation

πŸ”§ SYSTEM MANAGEMENT (9 tools)
β”œβ”€β”€ πŸ”„ Workflow Tools (3) ─── Evaluation suites, parallel execution
β”œβ”€β”€ πŸ“Š Calibration (2) ────── Judge agreement, rubric optimization  
└── πŸ₯ Server Tools (4) ───── Health monitoring, system management

🎯 TOTAL: 71 TOOLS ACROSS 15 CATEGORIES 🎯  # UPDATED!

Acceptance Criteria

  • Add 8 new prompt authoring tools as specified
  • Create comprehensive template library with examples
  • Integrate with existing quality assessment framework
  • Update README.md with new category and tool count
  • Add comprehensive documentation for each tool
  • Include usage examples in multiple formats (MCP, REST, HTTP)
  • Implement proper error handling and validation
  • Add unit tests for all new tools (>90% coverage)
  • Performance testing for prompt generation speed
  • Quality validation of generated prompts using existing tools

Priority

High - Complements existing evaluation capabilities with content creation

Dependencies

  • Existing MCP Eval Server infrastructure
  • Template engine (Jinja2 or similar)
  • Integration with current tool registration system
  • Access to existing quality assessment tools for validation

Future Enhancements

  • AI-Powered Optimization: Use ML models for advanced prompt optimization
  • Community Templates: User-contributed prompt template sharing
  • Integration Plugins: Direct integration with popular AI platforms
  • Visual Prompt Builder: GUI for prompt construction and testing
  • Collaborative Editing: Multi-user prompt development workflows

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestmcp-serversMCP Server SamplesoicOpen Innovation Community Contributions

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions