Skip to content

Commit c4b1994

Browse files
Move & rename replica parallelism fields (#1158)
Co-authored-by: David Eliahu <[email protected]>
1 parent d210064 commit c4b1994

File tree

25 files changed

+132
-138
lines changed

25 files changed

+132
-138
lines changed

cli/local/docker_spec.go

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ package local
1919
import (
2020
"context"
2121
"fmt"
22+
"math"
2223
"path/filepath"
2324
"strings"
2425

@@ -88,9 +89,10 @@ func getAPIEnv(api *spec.API, awsClient *aws.Client) []string {
8889
"CORTEX_MODELS="+strings.Join(api.ModelNames(), ","),
8990
"CORTEX_API_SPEC="+filepath.Join("/mnt/workspace", filepath.Base(api.Key)),
9091
"CORTEX_PROJECT_DIR="+_projectDir,
91-
"CORTEX_WORKERS_PER_REPLICA=1",
92-
"CORTEX_THREADS_PER_WORKER=1",
93-
"CORTEX_MAX_WORKER_CONCURRENCY=1000",
92+
"CORTEX_PROCESSES_PER_REPLICA="+s.Int32(api.Predictor.ProcessesPerReplica),
93+
"CORTEX_THREADS_PER_PROCESS="+s.Int32(api.Predictor.ThreadsPerProcess),
94+
// add 1 because it was required to achieve the target concurrency for 1 process, 1 thread
95+
"CORTEX_MAX_PROCESS_CONCURRENCY="+s.Int64(1+int64(math.Round(float64(consts.DefaultMaxReplicaConcurrency)/float64(api.Predictor.ProcessesPerReplica)))),
9496
"CORTEX_SO_MAX_CONN=1000",
9597
"AWS_REGION="+awsClient.Region,
9698
)

docs/deployments/api-configuration.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@ Reference the section below which corresponds to your Predictor type: [Python](#
1313
predictor:
1414
type: python
1515
path: <string> # path to a python file with a PythonPredictor class definition, relative to the Cortex root (required)
16+
processes_per_replica: <int> # the number of parallel serving processes to run on each replica (default: 1)
17+
threads_per_process: <int> # the number of threads per process (default: 1)
1618
config: <string: value> # arbitrary dictionary passed to the constructor of the Predictor (optional)
1719
python_path: <string> # path to the root of your Python folder that will be appended to PYTHONPATH (default: folder containing cortex.yaml)
1820
image: <string> # docker image to use for the Predictor (default: cortexlabs/python-predictor-cpu or cortexlabs/python-predictor-gpu based on compute)
@@ -33,9 +35,7 @@ Reference the section below which corresponds to your Predictor type: [Python](#
3335
min_replicas: <int> # minimum number of replicas (default: 1)
3436
max_replicas: <int> # maximum number of replicas (default: 100)
3537
init_replicas: <int> # initial number of replicas (default: <min_replicas>)
36-
workers_per_replica: <int> # the number of parallel serving workers to run on each replica (default: 1)
37-
threads_per_worker: <int> # the number of threads per worker (default: 1)
38-
target_replica_concurrency: <float> # the desired number of in-flight requests per replica, which the autoscaler tries to maintain (default: workers_per_replica * threads_per_worker)
38+
target_replica_concurrency: <float> # the desired number of in-flight requests per replica, which the autoscaler tries to maintain (default: processes_per_replica * threads_per_process)
3939
max_replica_concurrency: <int> # the maximum number of in-flight requests per replica before requests are rejected with error code 503 (default: 1024)
4040
window: <duration> # the time over which to average the API's concurrency (default: 60s)
4141
downscale_stabilization_period: <duration> # the API will not scale below the highest recommendation made during this period (default: 5m)
@@ -49,7 +49,7 @@ Reference the section below which corresponds to your Predictor type: [Python](#
4949
max_unavailable: <string | int> # maximum number of replicas that can be unavailable during an update; can be an absolute number, e.g. 5, or a percentage of desired replicas, e.g. 10% (default: 25%)
5050
```
5151
52-
See additional documentation for [autoscaling](autoscaling.md), [compute](compute.md), [networking](networking.md), [prediction monitoring](prediction-monitoring.md), and [overriding API images](system-packages.md).
52+
See additional documentation for [parallelism](parallelism.md), [autoscaling](autoscaling.md), [compute](compute.md), [networking](networking.md), [prediction monitoring](prediction-monitoring.md), and [overriding API images](system-packages.md).
5353
5454
## TensorFlow Predictor
5555
@@ -65,6 +65,8 @@ See additional documentation for [autoscaling](autoscaling.md), [compute](comput
6565
model_path: <string> # S3 path to an exported model (e.g. s3://my-bucket/exported_model) (required)
6666
signature_key: <string> # name of the signature def to use for prediction (required if your model has more than one signature def)
6767
...
68+
processes_per_replica: <int> # the number of parallel serving processes to run on each replica (default: 1)
69+
threads_per_process: <int> # the number of threads per process (default: 1)
6870
config: <string: value> # arbitrary dictionary passed to the constructor of the Predictor (optional)
6971
python_path: <string> # path to the root of your Python folder that will be appended to PYTHONPATH (default: folder containing cortex.yaml)
7072
image: <string> # docker image to use for the Predictor (default: cortexlabs/tensorflow-predictor)
@@ -86,9 +88,7 @@ See additional documentation for [autoscaling](autoscaling.md), [compute](comput
8688
min_replicas: <int> # minimum number of replicas (default: 1)
8789
max_replicas: <int> # maximum number of replicas (default: 100)
8890
init_replicas: <int> # initial number of replicas (default: <min_replicas>)
89-
workers_per_replica: <int> # the number of parallel serving workers to run on each replica (default: 1)
90-
threads_per_worker: <int> # the number of threads per worker (default: 1)
91-
target_replica_concurrency: <float> # the desired number of in-flight requests per replica, which the autoscaler tries to maintain (default: workers_per_replica * threads_per_worker)
91+
target_replica_concurrency: <float> # the desired number of in-flight requests per replica, which the autoscaler tries to maintain (default: processes_per_replica * threads_per_process)
9292
max_replica_concurrency: <int> # the maximum number of in-flight requests per replica before requests are rejected with error code 503 (default: 1024)
9393
window: <duration> # the time over which to average the API's concurrency (default: 60s)
9494
downscale_stabilization_period: <duration> # the API will not scale below the highest recommendation made during this period (default: 5m)
@@ -102,7 +102,7 @@ See additional documentation for [autoscaling](autoscaling.md), [compute](comput
102102
max_unavailable: <string | int> # maximum number of replicas that can be unavailable during an update; can be an absolute number, e.g. 5, or a percentage of desired replicas, e.g. 10% (default: 25%)
103103
```
104104
105-
See additional documentation for [autoscaling](autoscaling.md), [compute](compute.md), [networking](networking.md), [prediction monitoring](prediction-monitoring.md), and [overriding API images](system-packages.md).
105+
See additional documentation for [parallelism](parallelism.md), [autoscaling](autoscaling.md), [compute](compute.md), [networking](networking.md), [prediction monitoring](prediction-monitoring.md), and [overriding API images](system-packages.md).
106106
107107
## ONNX Predictor
108108
@@ -117,6 +117,8 @@ See additional documentation for [autoscaling](autoscaling.md), [compute](comput
117117
model_path: <string> # S3 path to an exported model (e.g. s3://my-bucket/exported_model.onnx) (required)
118118
signature_key: <string> # name of the signature def to use for prediction (required if your model has more than one signature def)
119119
...
120+
processes_per_replica: <int> # the number of parallel serving processes to run on each replica (default: 1)
121+
threads_per_process: <int> # the number of threads per process (default: 1)
120122
config: <string: value> # arbitrary dictionary passed to the constructor of the Predictor (optional)
121123
python_path: <string> # path to the root of your Python folder that will be appended to PYTHONPATH (default: folder containing cortex.yaml)
122124
image: <string> # docker image to use for the Predictor (default: cortexlabs/onnx-predictor-gpu or cortexlabs/onnx-predictor-cpu based on compute)
@@ -136,9 +138,7 @@ See additional documentation for [autoscaling](autoscaling.md), [compute](comput
136138
min_replicas: <int> # minimum number of replicas (default: 1)
137139
max_replicas: <int> # maximum number of replicas (default: 100)
138140
init_replicas: <int> # initial number of replicas (default: <min_replicas>)
139-
workers_per_replica: <int> # the number of parallel serving workers to run on each replica (default: 1)
140-
threads_per_worker: <int> # the number of threads per worker (default: 1)
141-
target_replica_concurrency: <float> # the desired number of in-flight requests per replica, which the autoscaler tries to maintain (default: workers_per_replica * threads_per_worker)
141+
target_replica_concurrency: <float> # the desired number of in-flight requests per replica, which the autoscaler tries to maintain (default: processes_per_replica * threads_per_process)
142142
max_replica_concurrency: <int> # the maximum number of in-flight requests per replica before requests are rejected with error code 503 (default: 1024)
143143
window: <duration> # the time over which to average the API's concurrency (default: 60s)
144144
downscale_stabilization_period: <duration> # the API will not scale below the highest recommendation made during this period (default: 5m)
@@ -152,4 +152,4 @@ See additional documentation for [autoscaling](autoscaling.md), [compute](comput
152152
max_unavailable: <string | int> # maximum number of replicas that can be unavailable during an update; can be an absolute number, e.g. 5, or a percentage of desired replicas, e.g. 10% (default: 25%)
153153
```
154154
155-
See additional documentation for [autoscaling](autoscaling.md), [compute](compute.md), [networking](networking.md), [prediction monitoring](prediction-monitoring.md), and [overriding API images](system-packages.md).
155+
See additional documentation for [parallelism](parallelism.md), [autoscaling](autoscaling.md), [compute](compute.md), [networking](networking.md), [prediction monitoring](prediction-monitoring.md), and [overriding API images](system-packages.md).

docs/deployments/autoscaling.md

Lines changed: 4 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,33 +4,25 @@ _WARNING: you are on the master branch, please refer to the docs on the branch t
44

55
Cortex autoscales your web services on a per-API basis based on your configuration.
66

7-
## Replica Parallelism
8-
9-
* `workers_per_replica` (default: 1): Each replica runs a web server with `workers_per_replica` workers, each of which runs in it's own process. For APIs running with multiple CPUs per replica, using 1-3 workers per unit of CPU generally leads to optimal throughput. For example, if `cpu` is 2, a value between 2 and 6 `workers_per_replica` is reasonable. The optimal number will vary based on the workload and the CPU request for the API.
10-
11-
* `threads_per_worker` (default: 1): Each worker uses a thread pool of size `threads_per_worker` to process requests. For applications that are not CPU intensive such as high I/O (e.g. downloading files), GPU-based inference or Inferentia ASIC-based inference, increasing the number of threads per worker can increase throughput. For CPU-bound applications such as running your model inference on a CPU, using 1 thread per worker is recommended to avoid unnecessary context switching. Some applications are not thread-safe, and therefore must be run with 1 thread per worker.
12-
13-
`workers_per_replica` * `threads_per_worker` represents the number of requests that your replica can work in parallel. For example, if `workers_per_replica` is 2 and `threads_per_worker` is 2, and the replica was hit with 5 concurrent requests, 4 would immediately begin to be processed, 1 would be waiting for a thread to become available, and the concurrency for the replica would be 5. If the replica was hit with 3 concurrent requests, all three would begin processing immediately, and the replica concurrency would be 3.
14-
157
## Autoscaling Replicas
168

179
* `min_replicas`: The lower bound on how many replicas can be running for an API.
1810

1911
* `max_replicas`: The upper bound on how many replicas can be running for an API.
2012

21-
* `target_replica_concurrency` (default: `workers_per_replica` * `threads_per_worker`): This is the desired number of in-flight requests per replica, and is the metric which the autoscaler uses to make scaling decisions.
13+
* `target_replica_concurrency` (default: `processes_per_replica` * `threads_per_process`): This is the desired number of in-flight requests per replica, and is the metric which the autoscaler uses to make scaling decisions.
2214

2315
Replica concurrency is simply how many requests have been sent to a replica and have not yet been responded to (also referred to as in-flight requests). Therefore, it includes requests which are currently being processed and requests which are waiting in the replica's queue.
2416

2517
The autoscaler uses this formula to determine the number of desired replicas:
2618

2719
`desired replicas = sum(in-flight requests accross all replicas) / target_replica_concurrency`
2820

29-
For example, setting `target_replica_concurrency` to `workers_per_replica` * `threads_per_worker` (the default) causes the cluster to adjust the number of replicas so that on average, requests are immediately processed without waiting in a queue, and workers/threads are never idle.
21+
For example, setting `target_replica_concurrency` to `processes_per_replica` * `threads_per_process` (the default) causes the cluster to adjust the number of replicas so that on average, requests are immediately processed without waiting in a queue, and processes/threads are never idle.
3022

31-
* `max_replica_concurrency` (default: 1024): This is the maximum number of in-flight requests per replica before requests are rejected with HTTP error code 503. `max_replica_concurrency` includes requests that are currently being processed as well as requests that are waiting in the replica's queue (a replica can actively process `workers_per_replica` * `threads_per_worker` requests concurrently, and will hold any additional requests in a local queue). Decreasing `max_replica_concurrency` and configuring the client to retry when it receives 503 responses will improve queue fairness by preventing requests from sitting in long queues.
23+
* `max_replica_concurrency` (default: 1024): This is the maximum number of in-flight requests per replica before requests are rejected with HTTP error code 503. `max_replica_concurrency` includes requests that are currently being processed as well as requests that are waiting in the replica's queue (a replica can actively process `processes_per_replica` * `threads_per_process` requests concurrently, and will hold any additional requests in a local queue). Decreasing `max_replica_concurrency` and configuring the client to retry when it receives 503 responses will improve queue fairness by preventing requests from sitting in long queues.
3224

33-
*Note (if `workers_per_replica` > 1): In reality, there is a queue per worker; for most purposes thinking of it as a per-replica queue will be sufficient, although in some cases the distinction is relevant. Because requests are randomly assigned to workers within a replica (which leads to unbalanced worker queues), clients may receive 503 responses before reaching `max_replica_concurrency`. For example, if you set `workers_per_replica: 2` and `max_replica_concurrency: 100`, each worker will be allowed to handle 50 requests concurrently. If your replica receives 90 requests that take the same amount of time to process, there is a 24.6% possibility that more than 50 requests are routed to 1 worker, and each request that is routed to that worker above 50 is responded to with a 503. To address this, it is recommended to implement client retries for 503 errors, or to increase `max_replica_concurrency` to minimize the probability of getting 503 responses.*
25+
*Note (if `processes_per_replica` > 1): In reality, there is a queue per process; for most purposes thinking of it as a per-replica queue will be sufficient, although in some cases the distinction is relevant. Because requests are randomly assigned to processes within a replica (which leads to unbalanced process queues), clients may receive 503 responses before reaching `max_replica_concurrency`. For example, if you set `processes_per_replica: 2` and `max_replica_concurrency: 100`, each process will be allowed to handle 50 requests concurrently. If your replica receives 90 requests that take the same amount of time to process, there is a 24.6% possibility that more than 50 requests are routed to 1 process, and each request that is routed to that process above 50 is responded to with a 503. To address this, it is recommended to implement client retries for 503 errors, or to increase `max_replica_concurrency` to minimize the probability of getting 503 responses.*
3426

3527
* `window` (default: 60s): The time over which to average the API wide in-flight requests (which is the sum of in-flight requests in each replica). The longer the window, the slower the autoscaler will react to changes in API wide in-flight requests, since it is averaged over the `window`. API wide in-flight requests is calculated every 10 seconds, so `window` must be a multiple of 10 seconds.
3628

docs/deployments/gpus.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,9 @@ To use GPUs:
1111

1212
## Tips
1313

14-
### If using `workers_per_replica` > 1, TensorFlow-based models, and Python Predictor
14+
### If using `processes_per_replica` > 1, TensorFlow-based models, and Python Predictor
1515

16-
When using `workers_per_replica` > 1 with TensorFlow-based models (including Keras) in the Python Predictor, loading the model in separate processes at the same time will throw a `CUDA_ERROR_OUT_OF_MEMORY: out of memory` error. This is because the first process that loads the model will allocate all of the GPU's memory and leave none to other processes. To prevent this from happening, the per-process GPU memory usage can be limited. There are two methods:
16+
When using `processes_per_replica` > 1 with TensorFlow-based models (including Keras) in the Python Predictor, loading the model in separate processes at the same time will throw a `CUDA_ERROR_OUT_OF_MEMORY: out of memory` error. This is because the first process that loads the model will allocate all of the GPU's memory and leave none to other processes. To prevent this from happening, the per-process GPU memory usage can be limited. There are two methods:
1717

1818
1\) Configure the model to allocate only as much memory as it requires, via [tf.config.experimental.set_memory_growth()](https://www.tensorflow.org/api_docs/python/tf/config/experimental/set_memory_growth):
1919

0 commit comments

Comments
 (0)