[FEATURE] add json schema support (#115)

jasonlizhengjian · benchislett · web-flow · commit 570da128730b · 2025-07-31T11:33:05.000-04:00
* Add --json-response flag for structured API responses Adds a new CLI flag that enables JSON response formatting: - Adds json_response field to RequestFuncInput model - Modifies OpenAI backend to apply JSON formatting when flag is enabled - Includes response_format and chat_template_kwargs settings - Prompts model to avoid premature JSON closure * change prompt * add --disable-thinking separately * slightly prompt change * update README * Implement JSON schema support for structured outputs - Add --json-schema-file and --json-schema-inline CLI arguments - Add --json-response-prompt for customizable JSON formatting messages - Extend RequestFuncInput and Client classes with json_schema support - Update OpenAI chat completions backend to use proper JSON schema format - Add sample JSON schema files for testing - Maintain backward compatibility with existing --json-response flag * Enhance JSON schema system with flexible prompt handling - Replace --json-response-prompt with unified --json-prompt argument - Add @file syntax support for loading prompts from files - Add --include-schema-in-prompt flag to include schema in prompt text - Implement comprehensive input validation with clear error messages - Simplify backend prompt logic with consistent schema formatting - Add extensive README documentation with examples and usage patterns - Remove deprecated --json-response-prompt for cleaner API - Fix error handling for malformed JSON responses in streaming mode * Fix overly general exception handling in main.py - Replace broad Exception catches with specific exception types - Use OSError, PermissionError for file operations - Use json.JSONDecodeError for JSON parsing errors - Improve error messages with more specific context * Clean up sample schemas, keep only simple_schema.json - Remove complex_schema.json and sample_response_schema.json - Keep simple_schema.json as the primary example schema - Update simple_schema.json with improved structure * Simplify JSON schema documentation in README - Remove verbose examples and compatibility notes - Keep only essential file-based and inline schema usage - Reference tests/data/simple_schema.json for example schema - Make documentation concise and focused * Refactor JSON validation to parse_args function - Move JSON argument validation from run_main() to parse_args() - Create validate_json_args() function for better separation of concerns - Process and validate JSON arguments early during argument parsing - Store processed custom_prompt and json_schema in args namespace - Maintain same validation logic but in proper location - Follow pattern of other argument validations in parse_args() * Consolidate JSON schema arguments into unified --json-schema flag Replace separate --json-schema-file and --json-schema-inline arguments with single --json-schema that supports both inline JSON and @file syntax, matching the pattern established by --json-prompt. * Clean up code to address comments * Update README.md Co-authored-by: Benjamin Chislett <benjamin.chislett@centml.ai> * Update README.md Co-authored-by: Benjamin Chislett <benjamin.chislett@centml.ai> --------- Co-authored-by: Benjamin Chislett <benjamin.chislett@centml.ai>
diff --git a/README.md b/README.md
@@ -59,6 +59,11 @@ After benchmarking, the results are saved to `output-file.json` (or specified by
 | `--disable-tqdm` | Specify to disable tqdm progress bar. |
 | `--best-of` | Number of best completions to return. |
 | `--use-beam-search` | Use beam search for completions. |
+| `--json-response` | Request responses in JSON object format from the API. |
+| `--json-prompt` | No additional context is included in the prompt. Use `--json-prompt` to add custom instructions (appended to end of original prompt) if desired when using one of the JSON modes. Supports inline text or file input with `@file` syntax (e.g., `--json-prompt @prompt.txt`). |
+| `--json-schema` | JSON schema for structured output validation. Supports inline JSON string or file input with `@file` syntax (e.g., `--json-schema @schema.json`). |
+| `--include-schema-in-prompt` | Include the JSON schema in the prompt text for better LLM comprehension. Requires `--json-schema` to be specified. |
+| `--disable-thinking` | Disable thinking mode in chat templates. |
 | `--output-file` | Output json file to save the results. |
 | `--debug` | Log debug messages. |
 | `--profile` | Use Torch Profiler. The endpoint must be launched with VLLM_TORCH_PROFILER_DIR to enable profiler. |
@@ -72,6 +77,18 @@ After benchmarking, the results are saved to `output-file.json` (or specified by
 | `--top-p` | Top-P to use for sampling. Defaults to None, or 1.0 for backends which require it to be specified. |
 | `--top-k` | Top-K to use for sampling. Defaults to None. |
 
+### JSON Schema Support
+
+For structured JSON outputs with schema validation:
+
+```bash
+# File-based schema (see tests/data/simple_schema.json for example)
+fib benchmark --json-schema @tests/data/simple_schema.json -n 20 -rps 10 --backend openai-chat --endpoint /v1/chat/completions
+
+# Inline schema  
+fib benchmark --json-schema '{"type":"object","properties":{"answer":{"type":"string"}},"required":["answer"]}' -n 20 -rps 10 --backend openai-chat --endpoint /v1/chat/completions
+```
+
 In addition to providing these arguments on the command-line, you can use `--config-file` to pre-define the parameters for your use case. Examples are provided in `examples/`
 
 ### Output
@@ -180,4 +197,5 @@ Mean ITL (ms):                           9.35
 Median ITL (ms):                         8.00     
 P99 ITL (ms):                            89.88    
 ==================================================
-```
+```
+
diff --git a/src/flexible_inference_benchmark/engine/backend_functions.py b/src/flexible_inference_benchmark/engine/backend_functions.py
@@ -43,6 +43,11 @@ class RequestFuncInput(BaseModel):
     top_p: Optional[float] = None
     top_k: Optional[int] = None
     run_id: Optional[str] = None
+    json_response: bool = False
+    custom_prompt: str = ""
+    disable_thinking: bool = False
+    json_schema: Optional[Dict[str, Any]] = None
+    include_schema_in_prompt: bool = False
 
 
 class RequestFuncOutput(BaseModel):
@@ -448,13 +453,50 @@ async def async_request_openai_chat_completions(
     with otel_span as span:
         async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
             assert not request_func_input.use_beam_search
+
+            # Apply custom prompt and schema formatting
+            append_msg = ""
+
+            # 1. Append custom prompt when provided
+            if request_func_input.custom_prompt:
+                append_msg += request_func_input.custom_prompt
+
+            # 2. Include schema in prompt if requested
+            if request_func_input.include_schema_in_prompt and request_func_input.json_schema:
+                if append_msg:
+                    append_msg += "\n\n"
+                append_msg += "Please follow this JSON schema for your response:\n```json\n"
+                append_msg += json.dumps(request_func_input.json_schema, indent=2)
+                append_msg += "\n```"
+
+            # Apply the combined message to content
+            if append_msg:
+                if isinstance(content_body, str):
+                    content_body += append_msg
+                else:
+                    content_body[-1]["text"] += append_msg
+
             payload = {
                 "model": request_func_input.model,
                 "messages": [{"role": "user", "content": content_body}],
                 "max_tokens": request_func_input.output_len,
                 "stream": request_func_input.stream,
                 "ignore_eos": request_func_input.ignore_eos,
             }
+
+            # Add JSON response format if flag is enabled
+            if request_func_input.json_schema:
+                payload["response_format"] = {
+                    "type": "json_schema",
+                    "json_schema": {"name": "response", "schema": request_func_input.json_schema, "strict": True},
+                }
+            elif request_func_input.json_response:
+                payload["response_format"] = {"type": "json_object"}
+
+            # Add thinking control if flag is enabled
+            if request_func_input.disable_thinking:
+                payload["chat_template_kwargs"] = {"enable_thinking": False}
+
             if request_func_input.stream:
                 payload["stream_options"] = {"include_usage": True}
             apply_sampling_params(payload, request_func_input, always_top_p=False)
@@ -505,7 +547,7 @@ async def async_request_openai_chat_completions(
                                         delta = None
                                         content = None
                                         reasoning_content = None
-                                        if request_func_input.stream and len(data["choices"]) > 0:
+                                        if request_func_input.stream and "choices" in data and len(data["choices"]) > 0:
                                             delta = data["choices"][0]["delta"]
                                             content = delta.get("content", None)
                                             reasoning_content = delta.get("reasoning_content", None)
diff --git a/src/flexible_inference_benchmark/engine/client.py b/src/flexible_inference_benchmark/engine/client.py
@@ -44,6 +44,11 @@ def __init__(
         top_p: Optional[float] = None,
         top_k: Optional[int] = None,
         run_id: Optional[str] = None,
+        json_response: bool = False,
+        custom_prompt: str = "",
+        disable_thinking: bool = False,
+        json_schema: Optional[Dict[str, Any]] = None,
+        include_schema_in_prompt: bool = False,
     ):
         self.backend = backend
         self.api_url = api_url
@@ -66,6 +71,11 @@ def __init__(
         self.top_p = top_p
         self.top_k = top_k
         self.run_id = run_id or str(uuid.uuid4())
+        self.json_response = json_response
+        self.custom_prompt = custom_prompt
+        self.disable_thinking = disable_thinking
+        self.json_schema = json_schema
+        self.include_schema_in_prompt = include_schema_in_prompt
 
     @property
     def request_func(
@@ -178,6 +188,11 @@ async def benchmark(
                 top_p=self.top_p,
                 top_k=self.top_k,
                 run_id=self.run_id,
+                json_response=self.json_response,
+                custom_prompt=self.custom_prompt,
+                disable_thinking=self.disable_thinking,
+                json_schema=self.json_schema,
+                include_schema_in_prompt=self.include_schema_in_prompt,
             )
             for (data_sample, media_sample) in zip(data, requests_media)
         ]
@@ -221,6 +236,12 @@ async def validate_url_endpoint(
             temperature=self.temperature,
             top_p=self.top_p,
             top_k=self.top_k,
+            run_id=self.run_id,
+            json_response=self.json_response,
+            custom_prompt=self.custom_prompt,
+            disable_thinking=self.disable_thinking,
+            json_schema=self.json_schema,
+            include_schema_in_prompt=self.include_schema_in_prompt,
         )
         return await self.send_request(-1, data, 0, None, None)
 
@@ -239,6 +260,15 @@ async def start_torch_profiler(self) -> Union[RequestFuncOutput, Any]:
             stream=self.stream,
             cookies=self.cookies,
             logprobs=self.logprobs,
+            temperature=self.temperature,
+            top_p=self.top_p,
+            top_k=self.top_k,
+            run_id=self.run_id,
+            json_response=self.json_response,
+            custom_prompt=self.custom_prompt,
+            disable_thinking=self.disable_thinking,
+            json_schema=self.json_schema,
+            include_schema_in_prompt=self.include_schema_in_prompt,
         )
         return await self.signal_profiler(0, data, 0, None, None)
 
@@ -257,5 +287,14 @@ async def stop_torch_profiler(self) -> Union[RequestFuncOutput, Any]:
             stream=self.stream,
             cookies=self.cookies,
             logprobs=self.logprobs,
+            temperature=self.temperature,
+            top_p=self.top_p,
+            top_k=self.top_k,
+            run_id=self.run_id,
+            json_response=self.json_response,
+            custom_prompt=self.custom_prompt,
+            disable_thinking=self.disable_thinking,
+            json_schema=self.json_schema,
+            include_schema_in_prompt=self.include_schema_in_prompt,
         )
         return await self.signal_profiler(0, data, 0, None, None)
diff --git a/src/flexible_inference_benchmark/main.py b/src/flexible_inference_benchmark/main.py
@@ -470,6 +470,37 @@ def add_benchmark_subparser(subparsers: argparse._SubParsersAction) -> Any:  # t
 
     benchmark_parser.add_argument("--use-beam-search", action="store_true", help="Use beam search for completions.")
 
+    benchmark_parser.add_argument(
+        "--json-response", action="store_true", help="Request responses in JSON format from the API."
+    )
+
+    benchmark_parser.add_argument(
+        "--json-prompt",
+        type=str,
+        default="",
+        help="Custom prompt message to append when using JSON modes. "
+        "Supports inline text or file input with @file syntax (e.g., --json-prompt @prompt.txt). "
+        "Always appended when specified, regardless of JSON mode type.",
+    )
+
+    benchmark_parser.add_argument(
+        "--json-schema",
+        type=str,
+        help="JSON schema for structured output validation. "
+        "Supports inline JSON string or file input with @file syntax (e.g., --json-schema @schema.json).",
+    )
+
+    benchmark_parser.add_argument(
+        "--include-schema-in-prompt",
+        action="store_true",
+        help="Include the JSON schema in the prompt text for better LLM comprehension. "
+        "Requires --json-schema to be specified.",
+    )
+
+    benchmark_parser.add_argument(
+        "--disable-thinking", action="store_true", help="Disable thinking mode in chat templates."
+    )
+
     benchmark_parser.add_argument(
         "--output-file",
         type=str,
@@ -515,6 +546,114 @@ def add_benchmark_subparser(subparsers: argparse._SubParsersAction) -> Any:  # t
     return benchmark_parser
 
 
+def validate_json_args(args: argparse.Namespace) -> None:
+    """Validate JSON-related arguments and load files."""
+    if args.subcommand != 'benchmark':
+        return
+
+    # Process JSON prompt with @file support
+    custom_prompt = ""
+    if args.json_prompt:
+        if args.json_prompt.startswith("@"):
+            # File-based prompt loading
+            prompt_file_path = args.json_prompt[1:]  # Remove @ prefix
+            try:
+                with open(prompt_file_path, 'r', encoding='utf-8') as f:
+                    custom_prompt = f.read().strip()
+                if not custom_prompt:
+                    logger.error(f"Prompt file '{prompt_file_path}' is empty")
+                    sys.exit(1)
+                logger.info(f"Loaded custom prompt from {prompt_file_path}")
+            except FileNotFoundError:
+                logger.error(f"Prompt file '{prompt_file_path}' does not exist")
+                sys.exit(1)
+            except UnicodeDecodeError as e:
+                logger.error(f"Cannot read prompt file '{prompt_file_path}': {e}")
+                sys.exit(1)
+            except (OSError, PermissionError) as e:
+                logger.error(f"Failed to load prompt file '{prompt_file_path}': {e}")
+                sys.exit(1)
+        else:
+            # Inline prompt
+            custom_prompt = args.json_prompt
+
+    # Store processed prompt back to args
+    args.json_prompt = custom_prompt
+
+    # Process JSON schema if provided
+    json_schema = None
+    original_json_schema = getattr(args, 'json_schema', None)
+    if args.json_schema:
+        if args.json_schema.startswith("@"):
+            # File-based schema loading
+            schema_file_path = args.json_schema[1:]  # Remove @ prefix
+            try:
+                with open(schema_file_path, 'r') as f:
+                    json_schema = json.load(f)
+                # Basic validation that it's a valid JSON schema structure
+                if not isinstance(json_schema, dict):
+                    logger.error("JSON schema must be a JSON object")
+                    sys.exit(1)
+                logger.info(f"Loaded JSON schema from {schema_file_path}")
+            except FileNotFoundError:
+                logger.error(f"JSON schema file '{schema_file_path}' does not exist")
+                sys.exit(1)
+            except (OSError, PermissionError) as e:
+                logger.error(f"Failed to load JSON schema file '{schema_file_path}': {e}")
+                sys.exit(1)
+            except json.JSONDecodeError as e:
+                logger.error(f"Invalid JSON in schema file '{schema_file_path}': {e}")
+                sys.exit(1)
+        else:
+            # Inline schema
+            try:
+                json_schema = json.loads(args.json_schema)
+                # Basic validation that it's a valid JSON schema structure
+                if not isinstance(json_schema, dict):
+                    logger.error("JSON schema must be a JSON object")
+                    sys.exit(1)
+                logger.info("Loaded inline JSON schema")
+            except json.JSONDecodeError as e:
+                logger.error(f"Invalid JSON in inline schema: {e}")
+                sys.exit(1)
+
+    # Store processed schema back to args
+    args.json_schema = json_schema
+
+    # Comprehensive input validation
+    # 1. Check for contradictory flag combinations
+    if json_schema and args.json_response:
+        logger.error("Cannot use both --json-response and --json-schema together")
+        logger.error("Suggestion: Choose either --json-response or --json-schema")
+        sys.exit(2)
+
+    # 2. Check for schema-dependent flags without schema
+    if args.include_schema_in_prompt:
+        if not json_schema:
+            logger.error("--include-schema-in-prompt requires a JSON schema")
+            logger.error("Suggestion: Add --json-schema <schema> or --json-schema @file")
+            sys.exit(3)
+
+    # 3. File size warnings (optional)
+    if original_json_schema and original_json_schema.startswith("@"):
+        schema_file_path = original_json_schema[1:]
+        try:
+            file_size = os.path.getsize(schema_file_path)
+            if file_size > 1024 * 1024:  # 1MB
+                logger.warning(f"Large schema file ({file_size / (1024*1024):.1f}MB) may impact performance")
+        except OSError:
+            pass  # File size check is optional
+
+    if args.json_prompt and args.json_prompt.startswith("@"):
+        prompt_file_path = args.json_prompt[1:]
+        try:
+            file_size = os.path.getsize(prompt_file_path)
+            if file_size > 100 * 1024:  # 100KB
+                logger.warning(f"Large prompt file ({file_size / 1024:.1f}KB) may impact performance")
+        except OSError:
+            pass  # File size check is optional
+
+
 def parse_args() -> argparse.Namespace:
 
     parser = argparse.ArgumentParser(description="CentML Inference Benchmark")
@@ -630,6 +769,9 @@ def fail(msg: str) -> None:
         if args.num_trials > MAX_TRIALS:
             logger.warning(f"High num_trials value ({args.num_trials}) may slow down prompt generation")
 
+    # Validate JSON-related arguments
+    validate_json_args(args)
+
     return args
 
 
@@ -716,6 +858,10 @@ def run_main(args: argparse.Namespace) -> None:
         endpoint = args.endpoint.strip("/")
         args.api_url = f"{base_url}/{endpoint}"
 
+        # JSON processing and validation handled in parse_args()
+        custom_prompt = args.json_prompt
+        json_schema = getattr(args, 'json_schema', None)
+
         client = Client(
             args.backend,
             args.api_url,
@@ -736,6 +882,11 @@ def run_main(args: argparse.Namespace) -> None:
             args.top_p,
             args.top_k,
             run_id=run_id,
+            json_response=args.json_response,
+            custom_prompt=custom_prompt,
+            disable_thinking=args.disable_thinking,
+            json_schema=json_schema,
+            include_schema_in_prompt=getattr(args, 'include_schema_in_prompt', False),
         )
         # disable verbose output for validation of the endpoint. This is done to avoid confusion on terminal output.
         client_verbose_value = client.verbose
diff --git a/tests/data/simple_schema.json b/tests/data/simple_schema.json
@@ -0,0 +1,16 @@
+{
+  "type": "object",
+  "properties": {
+    "answer": {
+      "type": "string",
+      "minLength": 2000,
+      "description": "The answer to the question"
+    },
+    "reasoning": {
+      "type": "string",
+      "description": "Explanation of the reasoning"
+    }
+  },
+  "required": ["answer"],
+  "additionalProperties": false
+}