Merge branch 'main' into docs/add-claude-md-and-agent-docs

scyyh11 · web-flow · commit 92148919bf5e · 2026-03-11T02:45:04.000-07:00
diff --git a/deploy/paddleocr_vl_docker/hps/README.md b/deploy/paddleocr_vl_docker/hps/README.md
@@ -15,8 +15,8 @@
 | 组件           | 说明                                   |
 |----------------|----------------------------------------|
 | FastAPI 网关   | 统一访问入口、简化客户端调用、并发控制 |
-| Triton 服务器  | 模型管理、动态批处理、推理调度         |
-| vLLM 服务器    | 连续批处理、VLM 推理                   |
+| Triton 服务器  | 版面检测模型（PP-DocLayoutV3）及产线串联逻辑，负责模型管理、动态批处理、推理调度 |
+| vLLM 服务器    | VLM（PaddleOCR-VL-1.5），连续批处理推理 |
 
 **Triton 模型：**
 
@@ -158,6 +158,28 @@ UVICORN_WORKERS=2
 
 Triton 自动将请求批处理以提高推理设备利用率。最大批处理大小通过模型配置文件中的 `max_batch_size` 参数控制（默认：8），配置文件位于模型仓库目录下的 `config.pbtxt`（如 `model_repo/layout-parsing/config.pbtxt`）。
 
+### Triton 实例数
+
+每个 Triton 模型的并行推理实例数通过 `config.pbtxt` 中的 `instance_group` 配置（默认：1）。增加实例数可以提高并行处理能力，但会占用更多设备资源。
+
+```
+# model_repo/layout-parsing/config.pbtxt
+instance_group [
+  {
+      count: 1       # 实例数，增大可提高并行度
+      kind: KIND_GPU
+      gpus: [ 0 ]
+  }
+]
+```
+
+实例数与动态批处理之间存在权衡：
+
+- **单实例（`count: 1`）**：动态批处理会将多个请求合并为一个批次并行执行，但同批次的请求需等待最慢的那个完成后才能一起返回，可能导致部分请求的时延升高。同时，单实例同一时刻只能处理一个批次，当前批次未完成时后续请求只能排队等待。适合显存有限或请求耗时较均匀的场景
+- **多实例（`count: 2+`）**：多个实例可以同时各自处理不同的批次，能够同时处理更多请求，减少排队等待时间，单个请求的时延也会有所改善。但需注意，同一实例内的批次仍然遵循动态批处理的行为（批内请求一起开始、一起结束）。每增加一个实例会额外占用一份版面检测模型的显存，同时也会增加对 VLM 推理服务的负载以及内存和 CPU 的使用，需根据推理设备的资源情况酌情设置
+
+非推理模型（如 `restructure-pages`）运行在 CPU 上，可根据 CPU 核数适当增加实例数。
+
 ## 故障排查与解决
 
 ### 服务无法启动
diff --git a/deploy/paddleocr_vl_docker/hps/README_en.md b/deploy/paddleocr_vl_docker/hps/README_en.md
@@ -15,8 +15,8 @@ Client → FastAPI Gateway → Triton Server → vLLM Server
 | Component       | Description                                                           |
 |-----------------|-----------------------------------------------------------------------|
 | FastAPI Gateway | Unified access point, simplified client calls, concurrency control    |
-| Triton Server   | Model management, dynamic batching, inference scheduling              |
-| vLLM Server     | Continuous batching, VLM inference                                    |
+| Triton Server   | Layout detection model (PP-DocLayoutV3) and pipeline orchestration; model management, dynamic batching, inference scheduling |
+| vLLM Server     | VLM (PaddleOCR-VL-1.5), continuous batching inference                |
 
 **Triton Models:**
 
@@ -158,6 +158,28 @@ Each Uvicorn worker is an independent process with its own event loop:
 
 Triton automatically batches requests to improve inference device utilization. The maximum batch size is controlled by the `max_batch_size` parameter in the model configuration file (default: 8), located at `config.pbtxt` under each model directory in the model repository (e.g., `model_repo/layout-parsing/config.pbtxt`).
 
+### Triton Instance Count
+
+The number of parallel inference instances for each Triton model is configured via the `instance_group` section in `config.pbtxt` (default: 1). Increasing the instance count improves parallelism but consumes more device resources.
+
+```
+# model_repo/layout-parsing/config.pbtxt
+instance_group [
+  {
+      count: 1       # Number of instances; increase for higher parallelism
+      kind: KIND_GPU
+      gpus: [ 0 ]
+  }
+]
+```
+
+There is a trade-off between instance count and dynamic batching:
+
+- **Single instance (`count: 1`)**: Dynamic batching combines multiple requests into one batch for parallel execution, but all requests in the same batch must wait for the slowest one to finish before results are returned, which may increase latency for faster requests. Additionally, a single instance can only process one batch at a time — subsequent requests must queue until the current batch completes. Best suited for scenarios with limited GPU memory or uniform request processing times
+- **Multiple instances (`count: 2+`)**: Multiple instances can process different batches simultaneously, allowing more requests to be handled concurrently. This reduces queuing time and improves latency for individual requests. Note that within each instance, dynamic batching behavior still applies (requests in the same batch start and finish together). Each additional instance consumes an extra copy of the layout detection model's GPU memory, increases the load on the VLM inference service, and uses more CPU and system memory. Adjust based on the available resources of your inference device
+
+Non-inference models (e.g., `restructure-pages`) run on CPU and can have their instance count increased based on available CPU cores.
+
 ## Troubleshooting and Resolution
 
 ### Service Fails to Start
diff --git a/docs/version3.x/installation.en.md b/docs/version3.x/installation.en.md
@@ -71,7 +71,7 @@ python -c "import paddle; print(paddle.__version__)"
 If the installation is successful, the following content will be output:
 
 ```bash
-3.0.0
+3.2.0
 ```
 
 ## 1.3 Installation of PaddlePaddle Wheel Package for Windows with NVIDIA 50 Series GPUs
diff --git a/docs/version3.x/installation.md b/docs/version3.x/installation.md
@@ -71,7 +71,7 @@ python -c "import paddle; print(paddle.__version__)"
 如果已安装成功，将输出以下内容：
 
 ```bash
-3.0.0
+3.2.0
 ```
 
 ## 1.3 Windows 系统适配 NVIDIA 50 系显卡的 PaddlePaddle wheel 包安装
diff --git a/docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md b/docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md
@@ -1625,7 +1625,7 @@ The parameters supported by this command are as follows:
 - MLX-VLM: [Refer to this document](./PaddleOCR-VL-Apple-Silicon.en.md)
 - llama.cpp:
     1. Install llama.cpp by referring to the `Quick start` section in the [llama.cpp github](https://github.com/ggml-org/llama.cpp).
-    2. Download the model files in gguf format: [megemini/PaddleOCR-VL-1.5-GGUF](https://modelscope.cn/models/megemini/PaddleOCR-VL-1.5-GGUF/files) or [megemini/PaddleOCR-VL-GGUF](https://modelscope.cn/models/megemini/PaddleOCR-VL-GGUF/files).
+    2. Download the model files in gguf format: [PaddlePaddle/PaddleOCR-VL-1.5-GGUF](https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5-GGUF).
     3. Execute the following command to start the inference service. For an introduction to the parameters, please refer to [LLaMA.cpp HTTP Server](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md):
 
         ```shell
diff --git a/docs/version3.x/pipeline_usage/PaddleOCR-VL.md b/docs/version3.x/pipeline_usage/PaddleOCR-VL.md
@@ -1604,7 +1604,7 @@ paddleocr genai_server --model_name PaddleOCR-VL-1.5-0.9B --backend vllm --port
 - MLX-VLM：[参考此文档](./PaddleOCR-VL-Apple-Silicon.md)
 - llama.cpp：
     1. 参考 [llama.cpp github](https://github.com/ggml-org/llama.cpp) 中的 `Quick start` 安装 llama.cpp。
-    2. 下载 gguf 格式的模型文件：[megemini/PaddleOCR-VL-1.5-GGUF](https://modelscope.cn/models/megemini/PaddleOCR-VL-1.5-GGUF/files) 或 [megemini/PaddleOCR-VL-GGUF](https://modelscope.cn/models/megemini/PaddleOCR-VL-GGUF/files)。
+    2. 下载 gguf 格式的模型文件：[PaddlePaddle/PaddleOCR-VL-1.5-GGUF](https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5-GGUF)。
     3. 执行以下命令启动推理服务，参数介绍可参考 [LLaMA.cpp HTTP Server](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md)：
 
         ```shell
diff --git a/skills/paddleocr-doc-parsing/SKILL.md b/skills/paddleocr-doc-parsing/SKILL.md
@@ -4,6 +4,18 @@ description: >
   Advanced document parsing with PaddleOCR. Returns complete document
   structure including text, tables, formulas, charts, and layout information. The AI agent extracts
   relevant content based on user needs.
+metadata:
+  openclaw:
+    requires:
+      env:
+        - PADDLEOCR_DOC_PARSING_API_URL
+        - PADDLEOCR_ACCESS_TOKEN
+        - PADDLEOCR_DOC_PARSING_TIMEOUT
+      bins:
+        - python
+    primaryEnv: PADDLEOCR_ACCESS_TOKEN
+    emoji: "📄"
+    homepage: https://github.com/PaddlePaddle/PaddleOCR/tree/main/skills/paddleocr-doc-parsing
 ---
 
 # PaddleOCR Document Parsing Skill
@@ -43,30 +55,32 @@ If the script execution fails (API not configured, network error, etc.):
 
 1. **Execute document parsing**:
    ```bash
-   python scripts/vl_caller.py --file-url "URL provided by user"
+   python scripts/vl_caller.py --file-url "URL provided by user" --pretty
    ```
    Or for local files:
    ```bash
-   python scripts/vl_caller.py --file-path "file path"
+   python scripts/vl_caller.py --file-path "file path" --pretty
    ```
 
    **Optional: explicitly set file type**:
    ```bash
-   python scripts/vl_caller.py --file-url "URL provided by user" --file-type 0
+   python scripts/vl_caller.py --file-url "URL provided by user" --file-type 0 --pretty
    ```
    - `--file-type 0`: PDF
    - `--file-type 1`: image
    - If omitted, the service can infer file type from input.
 
-   **Save result to file** (recommended):
-   ```bash
-   python scripts/vl_caller.py --file-url "URL" --output result.json --pretty
-   ```
-   - The script will display: `Result saved to: /absolute/path/to/result.json`
-   - This message appears on stderr, the JSON is saved to the file
-   - **Tell the user the file path** shown in the message
-
-2. **The script returns COMPLETE JSON** with all document content:
+   **Default behavior: save raw JSON to a temp file**:
+   - If `--output` is omitted, the script saves automatically under the system temp directory
+   - Default path pattern: `<system-temp>/paddleocr/doc-parsing/results/result_<timestamp>_<id>.json`
+   - If `--output` is provided, it overrides the default temp-file destination
+   - If `--stdout` is provided, JSON is printed to stdout and no file is saved
+   - In save mode, the script prints the absolute saved path on stderr: `Result saved to: /absolute/path/...`
+   - In default/custom save mode, read and parse the saved JSON file before responding
+   - In save mode, always tell the user the saved file path and that full raw JSON is available there
+   - Use `--stdout` only when you explicitly want to skip file persistence
+
+2. **The output JSON contains COMPLETE content** with all document data:
    - Headers, footers, page numbers
    - Main text content
    - Tables with structure
@@ -80,7 +94,7 @@ If the script execution fails (API not configured, network error, etc.):
    - Supported file types depend on the model and endpoint configuration.
    - Always follow the file type constraints documented by your endpoint API.
 
-3. **Extract what the user needs** from stable contract fields based on their request:
+3. **Extract what the user needs** from the output JSON using these fields:
    - Top-level `text`
    - `result[n].markdown`
    - `result[n].prunedResult`
@@ -89,15 +103,16 @@ If the script execution fails (API not configured, network error, etc.):
 
 **CRITICAL**: You must display the COMPLETE extracted content to the user based on their needs.
 
-- The script returns ALL document content in a structured format
+- The output JSON contains ALL document content in a structured format
+- In save mode, the raw provider result can be inspected in the saved JSON file
 - **Display the full content requested by the user**, do NOT truncate or summarize
 - If user asks for "all text", show the entire `text` field
 - If user asks for "tables", show ALL tables in the document
 - If user asks for "main content", filter out headers/footers but show ALL body text
 
 **What this means**:
 - ✅ **DO**: Display complete text, all tables, all formulas as requested
-- ✅ **DO**: Present content using stable contract fields: top-level `text`, `result[n].markdown`, and `result[n].prunedResult`
+- ✅ **DO**: Present content using these fields: top-level `text`, `result[n].markdown`, and `result[n].prunedResult`
 - ❌ **DON'T**: Truncate with "..." unless content is excessively long (>10,000 chars)
 - ❌ **DON'T**: Summarize or provide excerpts when user asks for full content
 - ❌ **DON'T**: Say "Here's a preview" when user expects complete output
@@ -126,7 +141,7 @@ Agent: "I found a document with multiple sections. Here's the beginning:
 
 ### Understanding the JSON Response
 
-The script returns a JSON envelope wrapping the raw API result:
+The output JSON uses an envelope wrapping the raw API result:
 
 ```json
 {
@@ -143,6 +158,8 @@ The script returns a JSON envelope wrapping the raw API result:
 - `result[n].prunedResult` - structured parsing output for each page (layout/content/confidence and related metadata)
 - `result[n].markdown` — full rendered page output in markdown/HTML
 
+> Raw result location (default): the temp-file path printed by the script on stderr
+
 ### Usage Examples
 
 **Example 1: Extract Full Document Text**
@@ -174,6 +191,14 @@ python scripts/vl_caller.py \
   --pretty
 ```
 
+**Example 4: Print JSON Without Saving**
+```bash
+python scripts/vl_caller.py \
+  --file-url "URL" \
+  --stdout \
+  --pretty
+```
+
 Then return:
 - Full `text` when user asks for full document content
 - `result[n].prunedResult` and `result[n].markdown` when user needs complete structured page data
diff --git a/skills/paddleocr-doc-parsing/references/output_schema.md b/skills/paddleocr-doc-parsing/references/output_schema.md
@@ -2,6 +2,8 @@
 
 This document defines the output envelope returned by `vl_caller.py`.
 
+By default, `vl_caller.py` saves the JSON envelope to a unique file under the system temp directory and prints the absolute saved path to `stderr`. Use `--output` when you need a custom destination, or `--stdout` when you want to skip file saving and print JSON directly.
+
 ## Output Envelope
 
 `vl_caller.py` wraps provider response in a stable structure:
@@ -84,12 +86,15 @@ Raw fields may vary by model version and endpoint.
 ## Command Examples
 
 ```bash
-# Parse document from URL
+# Parse document from URL (result auto-saves to the system temp directory)
 python scripts/paddleocr-doc-parsing/vl_caller.py --file-url "URL" --pretty
 
-# Parse local file
+# Parse local file (result auto-saves to the system temp directory)
 python scripts/paddleocr-doc-parsing/vl_caller.py --file-path "doc.pdf" --pretty
 
-# Save result to file
-python scripts/paddleocr-doc-parsing/vl_caller.py --file-url "URL" --output result.json
+# Save result to a custom file path
+python scripts/paddleocr-doc-parsing/vl_caller.py --file-url "URL" --output "./result.json" --pretty
+
+# Print JSON to stdout without saving a file
+python scripts/paddleocr-doc-parsing/vl_caller.py --file-url "URL" --stdout --pretty
 ```
diff --git a/skills/paddleocr-doc-parsing/scripts/smoke_test.py b/skills/paddleocr-doc-parsing/scripts/smoke_test.py
@@ -160,6 +160,9 @@ def main():
     print(
         '  python skills/paddleocr-doc-parsing/scripts/vl_caller.py --file-path "doc.pdf"'
     )
+    print(
+        "  Results are auto-saved to the system temp directory; the caller prints the saved path."
+    )
 
     return 0
 
diff --git a/skills/paddleocr-doc-parsing/scripts/vl_caller.py b/skills/paddleocr-doc-parsing/scripts/vl_caller.py
@@ -28,6 +28,9 @@
 import io
 import json
 import sys
+import tempfile
+import uuid
+from datetime import datetime
 from pathlib import Path
 
 # Fix Windows console encoding
@@ -41,21 +44,42 @@
 from lib import parse_document
 
 
+def get_default_output_path():
+    """Build a unique result path under the OS temp directory."""
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
+    short_id = uuid.uuid4().hex[:8]
+    return (
+        Path(tempfile.gettempdir())
+        / "paddleocr"
+        / "doc-parsing"
+        / "results"
+        / f"result_{timestamp}_{short_id}.json"
+    )
+
+
+def resolve_output_path(output_arg):
+    if output_arg:
+        return Path(output_arg).expanduser().resolve()
+    return get_default_output_path().resolve()
+
+
 def main():
     parser = argparse.ArgumentParser(
         description="PaddleOCR Document Parsing - with layout analysis",
         formatter_class=argparse.RawDescriptionHelpFormatter,
         epilog="""
 Examples:
-  # Parse document from URL
+  # Parse document from URL (result is auto-saved to the system temp directory)
   python scripts/paddleocr-doc-parsing/vl_caller.py --file-url "https://example.com/document.pdf"
 
-  # Parse local file
+  # Parse local file (result is auto-saved to the system temp directory)
   python scripts/paddleocr-doc-parsing/vl_caller.py --file-path "./invoice.pdf"
 
-  # Save result to file
-  python scripts/paddleocr-doc-parsing/vl_caller.py --file-url "URL" --output result.json --pretty
+  # Save result to a custom file path
+  python scripts/paddleocr-doc-parsing/vl_caller.py --file-url "URL" --output "./result.json" --pretty
 
+  # Print JSON to stdout without saving a file
+  python scripts/paddleocr-doc-parsing/vl_caller.py --file-url "URL" --stdout --pretty
 Configuration:
   Run: python scripts/paddleocr-doc-parsing/configure.py
   Or set in .env: PADDLEOCR_DOC_PARSING_API_URL, PADDLEOCR_ACCESS_TOKEN
@@ -79,8 +103,17 @@ def main():
     parser.add_argument(
         "--pretty", action="store_true", help="Pretty-print JSON output"
     )
-    parser.add_argument(
-        "--output", "-o", metavar="FILE", help="Save result to JSON file"
+    output_group = parser.add_mutually_exclusive_group()
+    output_group.add_argument(
+        "--output",
+        "-o",
+        metavar="FILE",
+        help="Save result to JSON file (default: auto-save to system temp directory)",
+    )
+    output_group.add_argument(
+        "--stdout",
+        action="store_true",
+        help="Print JSON to stdout instead of saving to a file",
     )
 
     args = parser.parse_args()
@@ -99,20 +132,19 @@ def main():
     indent = 2 if args.pretty else None
     json_output = json.dumps(result, indent=indent, ensure_ascii=False)
 
-    # Save to file or print
-    if args.output:
+    if args.stdout:
+        print(json_output)
+    else:
+        output_path = resolve_output_path(args.output)
+
+        # Save to file
         try:
-            output_path = Path(args.output).resolve()
             output_path.parent.mkdir(parents=True, exist_ok=True)
             output_path.write_text(json_output, encoding="utf-8")
             print(f"Result saved to: {output_path}", file=sys.stderr)
         except (PermissionError, OSError) as e:
-            print(f"Error: Cannot write to {args.output}: {e}", file=sys.stderr)
+            print(f"Error: Cannot write to {output_path}: {e}", file=sys.stderr)
             sys.exit(5)
-    else:
-        print(json_output)
-        if result["ok"]:
-            print("\nTip: Use --output result.json to save the result", file=sys.stderr)
 
     # Exit code based on result
     sys.exit(0 if result["ok"] else 1)
diff --git a/skills/paddleocr-text-recognition/SKILL.md b/skills/paddleocr-text-recognition/SKILL.md
diff --git a/skills/paddleocr-text-recognition/references/output_schema.md b/skills/paddleocr-text-recognition/references/output_schema.md
diff --git a/skills/paddleocr-text-recognition/scripts/ocr_caller.py b/skills/paddleocr-text-recognition/scripts/ocr_caller.py
diff --git a/skills/paddleocr-text-recognition/scripts/smoke_test.py b/skills/paddleocr-text-recognition/scripts/smoke_test.py

Original file line number	Diff line number	Diff line change
`@@ -160,6 +160,9 @@ def main():`
`160`	`160`	`print(`
`161`	`161`	`' python skills/paddleocr-doc-parsing/scripts/vl_caller.py --file-path "doc.pdf"'`
`162`	`162`	`)`
	`163`	`+ print(`
	`164`	`+ " Results are auto-saved to the system temp directory; the caller prints the saved path."`
	`165`	`+ )`
`163`	`166`
`164`	`167`	`return 0`
`165`	`168`