You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| FastAPI Gateway | Unified access point, simplified client calls, concurrency control |
18
-
| Triton Server |Model management, dynamic batching, inference scheduling |
19
-
| vLLM Server |Continuous batching, VLM inference |
18
+
| Triton Server |Layout detection model (PP-DocLayoutV3) and pipeline orchestration; model management, dynamic batching, inference scheduling|
19
+
| vLLM Server |VLM (PaddleOCR-VL-1.5), continuous batching inference|
20
20
21
21
**Triton Models:**
22
22
@@ -158,6 +158,28 @@ Each Uvicorn worker is an independent process with its own event loop:
158
158
159
159
Triton automatically batches requests to improve inference device utilization. The maximum batch size is controlled by the `max_batch_size` parameter in the model configuration file (default: 8), located at `config.pbtxt` under each model directory in the model repository (e.g., `model_repo/layout-parsing/config.pbtxt`).
160
160
161
+
### Triton Instance Count
162
+
163
+
The number of parallel inference instances for each Triton model is configured via the `instance_group` section in `config.pbtxt` (default: 1). Increasing the instance count improves parallelism but consumes more device resources.
164
+
165
+
```
166
+
# model_repo/layout-parsing/config.pbtxt
167
+
instance_group [
168
+
{
169
+
count: 1 # Number of instances; increase for higher parallelism
170
+
kind: KIND_GPU
171
+
gpus: [ 0 ]
172
+
}
173
+
]
174
+
```
175
+
176
+
There is a trade-off between instance count and dynamic batching:
177
+
178
+
-**Single instance (`count: 1`)**: Dynamic batching combines multiple requests into one batch for parallel execution, but all requests in the same batch must wait for the slowest one to finish before results are returned, which may increase latency for faster requests. Additionally, a single instance can only process one batch at a time — subsequent requests must queue until the current batch completes. Best suited for scenarios with limited GPU memory or uniform request processing times
179
+
-**Multiple instances (`count: 2+`)**: Multiple instances can process different batches simultaneously, allowing more requests to be handled concurrently. This reduces queuing time and improves latency for individual requests. Note that within each instance, dynamic batching behavior still applies (requests in the same batch start and finish together). Each additional instance consumes an extra copy of the layout detection model's GPU memory, increases the load on the VLM inference service, and uses more CPU and system memory. Adjust based on the available resources of your inference device
180
+
181
+
Non-inference models (e.g., `restructure-pages`) run on CPU and can have their instance count increased based on available CPU cores.
Copy file name to clipboardExpand all lines: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1625,7 +1625,7 @@ The parameters supported by this command are as follows:
1625
1625
- MLX-VLM: [Refer to this document](./PaddleOCR-VL-Apple-Silicon.en.md)
1626
1626
- llama.cpp:
1627
1627
1. Install llama.cpp by referring to the `Quick start` section in the [llama.cpp github](https://github.com/ggml-org/llama.cpp).
1628
-
2. Download the model files in gguf format: [megemini/PaddleOCR-VL-1.5-GGUF](https://modelscope.cn/models/megemini/PaddleOCR-VL-1.5-GGUF/files) or [megemini/PaddleOCR-VL-GGUF](https://modelscope.cn/models/megemini/PaddleOCR-VL-GGUF/files).
1628
+
2. Download the model files in gguf format: [PaddlePaddle/PaddleOCR-VL-1.5-GGUF](https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5-GGUF).
1629
1629
3. Execute the following command to start the inference service. For an introduction to the parameters, please refer to [LLaMA.cpp HTTP Server](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md):
Copy file name to clipboardExpand all lines: skills/paddleocr-doc-parsing/references/output_schema.md
+9-4Lines changed: 9 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,6 +2,8 @@
2
2
3
3
This document defines the output envelope returned by `vl_caller.py`.
4
4
5
+
By default, `vl_caller.py` saves the JSON envelope to a unique file under the system temp directory and prints the absolute saved path to `stderr`. Use `--output` when you need a custom destination, or `--stdout` when you want to skip file saving and print JSON directly.
6
+
5
7
## Output Envelope
6
8
7
9
`vl_caller.py` wraps provider response in a stable structure:
@@ -84,12 +86,15 @@ Raw fields may vary by model version and endpoint.
84
86
## Command Examples
85
87
86
88
```bash
87
-
# Parse document from URL
89
+
# Parse document from URL (result auto-saves to the system temp directory)
0 commit comments