Skip to content

Commit b80c17d

Browse files
authored
Allow more than one http rest & grpc listeners (#3749)
### 🛠 Summary CVS-170537 - added support for comma separated bind address list via CLI/C-API - changed localhost to 127.0.0.1 in docs where requests pip package is used because request localhost can introduce initial 2s delay on windows systems due to ipv6 connection try before actual ipv4 connection is estabilished - updated security considerations doc - updated performance optimization doc
1 parent 30c6761 commit b80c17d

File tree

16 files changed

+122
-28
lines changed

16 files changed

+122
-28
lines changed

demos/age_gender_recognition/python/age_gender_recognition.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
import argparse
2222

2323
parser = argparse.ArgumentParser(description='Client for age gender recognition')
24-
parser.add_argument('--rest_address', required=False, default='localhost', help='Specify url to REST API service. default:localhost')
24+
parser.add_argument('--rest_address', required=False, default='127.0.0.1', help='Specify url to REST API service. default:127.0.0.1')
2525
parser.add_argument('--rest_port', required=False, default=9001, help='Specify port to REST API service. default: 9178')
2626
parser.add_argument('--model_name', required=False, default='age_gender', help='Model name to request. default: age_gender')
2727
parser.add_argument('--image_input_path', required=True, help='Input image path.')

demos/continuous_batching/structured_output/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@ payload = {
120120
}
121121

122122
headers = {"Content-Type": "application/json", "Authorization": "not used"}
123-
response = requests.post("http://localhost:8000/v3/chat/completions", json=payload, headers=headers)
123+
response = requests.post("http://127.0.0.1:8000/v3/chat/completions", json=payload, headers=headers)
124124
json_response = response.json()
125125

126126
print(json_response["choices"][0]["message"]["content"])
@@ -138,7 +138,7 @@ pip install openai
138138
```python
139139
from openai import OpenAI
140140
from pydantic import BaseModel
141-
base_url = "http://localhost:8000/v3"
141+
base_url = "http://127.0.0.1:8000/v3"
142142
model_name = "OpenVINO/Mistral-7B-Instruct-v0.3-int4-cw-ov"
143143
client = OpenAI(base_url=base_url, api_key="unused")
144144
class CalendarEvent(BaseModel):
@@ -174,14 +174,14 @@ It will be executed with the response_format request field including the schema
174174
```console
175175
pip install datasets tqdm openai jsonschema
176176
curl -L https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/continuous_batching/structured_output/accuracy_test.py -O
177-
python accuracy_test.py --base_url http://localhost:8000/v3 --model OpenVINO/Mistral-7B-Instruct-v0.3-int4-cw-ov --concurrency 50 --limit 1000
177+
python accuracy_test.py --base_url http://127.0.0.1:8000/v3 --model OpenVINO/Mistral-7B-Instruct-v0.3-int4-cw-ov --concurrency 50 --limit 1000
178178
```
179179
```
180180
Requests: 1000, Successful responses: 1000, Exact matches: 135, Schema matches: 435 Invalid inputs: 0
181181
```
182182

183183
```console
184-
python accuracy_test.py --base_url http://localhost:8000/v3 --model OpenVINO/Mistral-7B-Instruct-v0.3-int4-cw-ov --enable_response_format --concurrency 50 --limit 1000
184+
python accuracy_test.py --base_url http://127.0.0.1:8000/v3 --model OpenVINO/Mistral-7B-Instruct-v0.3-int4-cw-ov --enable_response_format --concurrency 50 --limit 1000
185185
```
186186
```
187187
Requests: 1000, Successful responses: 1000, Exact matches: 217, Schema matches: 828 Invalid inputs: 0

demos/continuous_batching/vlm/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -239,7 +239,7 @@ curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/m
239239
```python
240240
import requests
241241
import base64
242-
base_url='http://localhost:8000/v3'
242+
base_url='http://127.0.0.1:8000/v3'
243243
model_name = "OpenGVLab/InternVL2-2B"
244244

245245
def convert_image(Image):

demos/rerank/README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,7 @@ pip3 install cohere
130130
```bash
131131
echo '
132132
import cohere
133-
client = cohere.Client(base_url="http://localhost:8000/v3", api_key="not_used")
133+
client = cohere.Client(base_url="http://127.0.0.1:8000/v3", api_key="not_used")
134134
responses = client.rerank(query="hello",documents=["welcome","farewell"], model="BAAI/bge-reranker-large")
135135
for response in responses.results:
136136
print(f"index {response.index}, relevance_score {response.relevance_score}")' > rerank_client.py
@@ -178,7 +178,7 @@ documents = [
178178
document_template.format(doc=doc, suffix=suffix) for doc in documents
179179
]
180180
181-
response = requests.post("http://localhost:8125/v3/rerank",
181+
response = requests.post("http://127.0.0.1:8000/v3/rerank",
182182
json={
183183
"model": "tomaarsen/Qwen3-Reranker-0.6B-seq-cls",
184184
"query": query,
@@ -199,7 +199,7 @@ It will return response similar to:
199199

200200
```bash
201201
git clone https://github.com/openvinotoolkit/model_server
202-
python model_server/demos/rerank/compare_results.py --query "hello" --document "welcome" --document "farewell" --base_url http://localhost:8000/v3/
202+
python model_server/demos/rerank/compare_results.py --query "hello" --document "welcome" --document "farewell" --base_url http://127.0.0.1:8000/v3/
203203
query hello
204204
documents ['welcome', 'farewell']
205205
HF Duration: 145.731 ms
@@ -214,7 +214,7 @@ An asynchronous benchmarking client can be used to access the model server perfo
214214
```bash
215215
cd model_server/demos/benchmark/embeddings/
216216
pip install -r requirements.txt
217-
python benchmark_embeddings.py --api_url http://localhost:8000/v3/rerank --backend ovms_rerank --dataset synthetic --synthetic_length 500 --request_rate inf --batch_size 20 --model BAAI/bge-reranker-large
217+
python benchmark_embeddings.py --api_url http://127.0.0.1:8000/v3/rerank --backend ovms_rerank --dataset synthetic --synthetic_length 500 --request_rate inf --batch_size 20 --model BAAI/bge-reranker-large
218218
Number of documents: 1000
219219
100%|██████████████████████████████████████| 50/50 [00:19<00:00, 2.53it/s]
220220
Tokens: 501000
@@ -224,7 +224,7 @@ Mean latency: 10268 ms
224224
Median latency: 10249 ms
225225
Average document length: 501.0 tokens
226226

227-
python benchmark_embeddings.py --api_url http://localhost:8000/v3/rerank --backend ovms_rerank --dataset synthetic --synthetic_length 500 --request_rate inf --batch_size 20 --model BAAI/bge-reranker-large
227+
python benchmark_embeddings.py --api_url http://127.0.0.1:8000/v3/rerank --backend ovms_rerank --dataset synthetic --synthetic_length 500 --request_rate inf --batch_size 20 --model BAAI/bge-reranker-large
228228
Number of documents: 1000
229229
100%|██████████████████████████████████████| 50/50 [00:19<00:00, 2.53it/s]
230230
Tokens: 501000
@@ -234,7 +234,7 @@ Mean latency: 10268 ms
234234
Median latency: 10249 ms
235235
Average document length: 501.0 tokens
236236

237-
python benchmark_embeddings.py --api_url http://localhost:8000/v3/rerank --backend ovms_rerank --dataset Cohere/wikipedia-22-12-simple-embeddings --request_rate inf --batch_size 20 --model BAAI/bge-reranker-large
237+
python benchmark_embeddings.py --api_url http://127.0.0.1:8000/v3/rerank --backend ovms_rerank --dataset Cohere/wikipedia-22-12-simple-embeddings --request_rate inf --batch_size 20 --model BAAI/bge-reranker-large
238238
Number of documents: 1000
239239
100%|██████████████████████████████████████| 50/50 [00:09<00:00, 5.55it/s]
240240
Tokens: 92248

demos/vlm_npu/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,7 @@ curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/js
168168
```python
169169
import requests
170170
import base64
171-
base_url='http://localhost:8000/v3'
171+
base_url='http://127.0.0.1:8000/v3'
172172
model_name = "microsoft/Phi-3.5-vision-instruct"
173173

174174
def convert_image(Image):

docs/clients_genai.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ print(response.choices[0].message)
6767
import requests
6868
payload = {"model": "meta-llama/Llama-2-7b-chat-hf", "messages": [ {"role": "user","content": "Say this is a test" }]}
6969
headers = {"Content-Type": "application/json", "Authorization": "not used"}
70-
response = requests.post("http://localhost:8000/v3/chat/completions", json=payload, headers=headers)
70+
response = requests.post("http://127.0.0.1:8000/v3/chat/completions", json=payload, headers=headers)
7171
print(response.text)
7272
```
7373
:::
@@ -147,7 +147,7 @@ print(response.choices[0].text)
147147
import requests
148148
payload = {"model": "meta-llama/Llama-2-7b", "prompt": "Say this is a test"}
149149
headers = {"Content-Type": "application/json", "Authorization": "not used"}
150-
response = requests.post("http://localhost:8000/v3/completions", json=payload, headers=headers)
150+
response = requests.post("http://127.0.0.1:8000/v3/completions", json=payload, headers=headers)
151151
print(response.text)
152152
```
153153
:::
@@ -280,7 +280,7 @@ for data in responses.data:
280280
import requests
281281
payload = {"model": "Alibaba-NLP/gte-large-en-v1.5", "input": "hello world"}
282282
headers = {"Content-Type": "application/json", "Authorization": "not used"}
283-
response = requests.post("http://localhost:8000/v3/embeddings", json=payload, headers=headers)
283+
response = requests.post("http://127.0.0.1:8000/v3/embeddings", json=payload, headers=headers)
284284
print(response.text)
285285
```
286286
:::
@@ -435,7 +435,7 @@ for res in responses.results:
435435
import requests
436436
payload = {"model": "BAAI/bge-reranker-large", "query": "Hello", "documents":["Welcome","Farewell"]}
437437
headers = {"Content-Type": "application/json", "Authorization": "not used"}
438-
response = requests.post("http://localhost:8000/v3/rerank", json=payload, headers=headers)
438+
response = requests.post("http://127.0.0.1:8000/v3/rerank", json=payload, headers=headers)
439439
print(response.text)
440440
```
441441
:::

docs/parameters.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,8 @@ Configuration options for the server are defined only via command-line options a
3636
|---|---|---|
3737
| `port` | `integer` | Number of the port used by gRPC sever. |
3838
| `rest_port` | `integer` | Number of the port used by HTTP server (if not provided or set to 0, HTTP server will not be launched). |
39-
| `grpc_bind_address` | `string` | Network interface address or a hostname, to which gRPC server will bind to. Default: all interfaces: 0.0.0.0 |
40-
| `rest_bind_address` | `string` | Network interface address or a hostname, to which REST server will bind to. Default: all interfaces: 0.0.0.0 |
39+
| `grpc_bind_address` | `string` | Comma separated list of ipv4/ipv6 network interface addresses or hostnames, to which gRPC server will bind to. Default: all interfaces: 0.0.0.0 |
40+
| `rest_bind_address` | `string` | Comma separated list of ipv4/ipv6 network interface addresses or hostnames, to which REST server will bind to. Default: all interfaces: 0.0.0.0 |
4141
| `grpc_workers` | `integer` | Number of the gRPC server instances (must be from 1 to CPU core count). Default value is 1 and it's optimal for most use cases. Consider setting higher value while expecting heavy load. |
4242
| `rest_workers` | `integer` | Number of HTTP server threads. Effective when `rest_port` > 0. Default value is set based on the number of CPUs. |
4343
| `file_system_poll_wait_seconds` | `integer` | Time interval between config and model versions changes detection in seconds. Default value is 1. Zero value disables changes monitoring. |

docs/performance_tuning.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,23 @@ To save power, the OS can decrease the CPU frequency and increase a volatility o
144144
$ cpupower frequency-set --min 3.1GHz
145145
```
146146

147+
## Network Configuration for Optimal Performance
148+
149+
By default, OVMS endpoints are bound to all ipv4 addresses. On same systems, which route localhost name to ipv6 address, it might cause extra time on the client side to switch to ipv4. It can effectively results with extra 1-2s latency.
150+
It can be overcome by switching the API URL to `http://127.0.0.1` on the client side.
151+
152+
To optimize network connection performance:
153+
154+
Alternatively ipv6 can be enabled in the model server using `--grpc_bind_address` and `--rest_bind_address`.
155+
For example:
156+
```
157+
--grpc_bind_address 127.0.0.1,::1 --rest_bind_address 127.0.0.1,::1
158+
```
159+
or
160+
```
161+
--grpc_bind_address 0.0.0.0,:: --rest_bind_address 0.0.0.0,::
162+
```
163+
147164
## Tuning Model Server configuration parameters
148165

149166
OpenVINO Model Server in C++ implementation is using scalable multithreaded gRPC and REST interface, however in some hardware configuration it might become a bottleneck for high performance backend with OpenVINO.

docs/security_considerations.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,11 @@ docker run --rm -d --user $(id -u):$(id -g) --read-only --tmpfs /tmp -p 9000:900
1313
---
1414
OpenVINO Model Server currently does not provide access restrictions and traffic encryption on gRPC and REST API endpoints. The endpoints can be secured using network settings like docker network settings or network firewall on the host. The recommended configuration is to place OpenVINO Model Server behind any reverse proxy component or load balancer, which provides traffic encryption and user authorization.
1515

16+
When deploying in environments where only local access is required, administrators can configure the server to bind exclusively to localhost addresses. This can be achieved by setting the bind address to `127.0.0.1` for IPv4 or `::1` for IPv6, which restricts incoming connections to the local machine only. This configuration prevents external network access to the server endpoints, providing an additional layer of security for local development or testing environments.
17+
```
18+
--grpc_bind_address 127.0.0.1,::1 --rest_bind_address 127.0.0.1,::1
19+
```
20+
1621
See also:
1722
- [Securing OVMS with NGINX](../extras/nginx-mtls-auth/README.md)
1823
- [Securing models with OVSA](https://docs.openvino.ai/2025/about-openvino/openvino-ecosystem/openvino-project/openvino-security-add-on.html)

docs/stateful_models.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -259,7 +259,7 @@ signature = "serving_default"
259259
request_body = json.dumps({"signature_name": signature,'inputs': inputs})
260260
261261
# Send request to OVMS and get response
262-
response = requests.post("localhost:5555/v1/models/stateful_model:predict", data=request_body)
262+
response = requests.post("127.0.0.1:5555/v1/models/stateful_model:predict", data=request_body)
263263
264264
# Parse response
265265
response_body = json.loads(response.text)

0 commit comments

Comments
 (0)