diff --git a/model-group-deployment/heterogeneous_model_group_deployment/heterogenous_model_group_deployment.md b/model-group-deployment/heterogeneous_model_group_deployment/heterogenous_model_group_deployment.md new file mode 100644 index 00000000..5f5f3819 --- /dev/null +++ b/model-group-deployment/heterogeneous_model_group_deployment/heterogenous_model_group_deployment.md @@ -0,0 +1,159 @@ +# Heterogeneous Model Group Deployment + +## Description + +A **Heterogeneous Model Group** comprises models built on different ML frameworks, such as PyTorch, TensorFlow, ONNX, etc. This group type allows for the deployment of diverse model architectures within a single serving environment. + +> ℹ️ Heterogeneous model groups **do not** require a shared model group artifact, as models in the group may rely on different runtimes. + +## Use Case + +Ideal for scenarios requiring multiple models with different architectures or frameworks deployed together under a unified endpoint. + +## Supported Containers + +- **BYOC (Bring Your Own Container)** that satisfies the **BYOC Contract** requirements. +- Customers are encouraged to use **NVIDIA Triton Inference Server**, which provides built-in support for diverse frameworks. + +## Serving Mechanism + +- Customers should use the **BYOC** deployment flow. +- **NVIDIA Triton Inference Server** is recommended for hosting models built with PyTorch, TensorFlow, ONNX Runtime, Custom Python, etc. +- Each model is routed to its corresponding backend automatically. +- **Triton** handles load balancing, routing, and execution optimization across model types. + +For details on dependency management, refer to the section [Dependency Management for Heterogeneous Model Group](#dependency-management-for-heterogeneous-model-group). + +## Heterogeneous Model Group Structure + +```json +{ + "modelGroupsDetails": { + "modelGroupConfigurationDetails": { + "modelGroupType": "HETEROGENEOUS" + }, + "modelIds": [ + { + "inferenceKey": "model1", + "modelId": "ocid.datasciencemodel.xxx1" + }, + { + "inferenceKey": "model2", + "modelId": "ocid.datasciencemodel.xxx2" + }, + { + "inferenceKey": "model3", + "modelId": "ocid.datasciencemodel.xxx3" + } + ] + } +} +``` + +> **Note:** +> For **BYOC**, Model Deployment enforces a **contract** that containers must follow: +> - Must expose a web server. +> - Must include all runtime dependencies needed to load and run the ML model binaries. + +## Dependency Management for Heterogeneous Model Group + +> **Note:** This section is applicable only when using the **NVIDIA Triton Inference Server** for Heterogeneous deployments. + +### Overview + +Triton supports multiple ML frameworks and serves them through corresponding backends. + +Triton loads models from one or more **model repositories**, each containing framework-specific models and configuration files. + +### Natively Supported Backends + +For native backends (e.g., ONNX, TF, PT), models must be organized as per **Triton model repository format**. + +#### Sample ONNX Model Directory Structure + +``` +model_repository/ +└── onnx_model/ + ├── 1/ + │ └── model.onnx + └── config.pbtxt +``` + +#### Sample `config.pbtxt` + +```text +name: "onnx_model" +platform: "onnxruntime_onnx" +input [ + { + name: "input_tensor" + data_type: TYPE_FP32 + dims: [ -1, 3, 224, 224 ] + } +] +output [ + { + name: "output_tensor" + data_type: TYPE_FP32 + dims: [ -1, 1000 ] + } +] +``` + +✅ No dependency conflicts are expected for natively supported models. + +### Using Python Backend + +For models that are not supported natively, Triton provides a **Python backend**. + +#### Python Model Directory Structure + +``` +models/ +└── add_sub/ + ├── 1/ + │ └── model.py + └── config.pbtxt +``` + +#### If Python Version Differs (Custom Stub) + +If the default Python version is insufficient, compile a **custom Python backend stub**. + +``` +models/ +└── model_a/ + ├── 1/ + │ └── model.py + ├── config.pbtxt + └── triton_python_backend_stub +``` + +### Models with Custom Execution Environments + +Use **Conda-Pack** to bundle all Python dependencies and isolate them per model. + +#### Sample Structure with Conda-Pack + +``` +models/ +└── model_a/ + ├── 1/ + │ └── model.py + ├── config.pbtxt + ├── env/ + │ └── model_a_env.tar.gz + └── triton_python_backend_stub +``` + +#### Add This to `config.pbtxt` for Custom Environment + +```text +name: "model_a" +backend: "python" + +parameters: { + key: "EXECUTION_ENV_PATH", + value: {string_value: "$$TRITON_MODEL_DIRECTORY/env/model_a_env.tar.gz"} +} +``` diff --git a/model-group-deployment/llm_stacked_inferencing/llm-stacked-inferencing.md b/model-group-deployment/llm_stacked_inferencing/llm-stacked-inferencing.md new file mode 100644 index 00000000..a8ca5f95 --- /dev/null +++ b/model-group-deployment/llm_stacked_inferencing/llm-stacked-inferencing.md @@ -0,0 +1,261 @@ +# LLM Stacked Inferencing + +Ideal for scenarios involving large language models, this capability allows multiple fine-tuned weights to be packaged and deployed together with the base +model. The selected fine-tuned weights are applied at run time thereby maximising GPU utilisation and inference efficiency. This capability also aids in +performing A/B experimentation for a collection of fine-tuned weights. + +To deploy this configuration, the model group feature is used create a logical grouping of the base and fine-tuned models and subsequently the model +group is deployed as a stacked inference model deployment. + +The deployment type is as below + +**STACKED**: This group is specifically designed for large language models (LLMs) with a base model and multiple fine-tuned weights. + +> **Note** +> Stacked Inferencing currently supports only **vLLM containers**. +> To know more about features used during stacked inferencing, refer to the vLLM official user guide:https://docs.vllm.ai/en/stable/features/lora.html#dynamically-serving-lora-adapters. + +## Deployment Steps + +### 1. Create a Stacked Model Group + +Use the model group feature to logically group the base and fine-tuned models. + +### 2. Deployment Configuration + +Update the deployment configuration with the stacked model group ID. Include the following in the `environmentVariables` for vLLM container: + +```json +{ + "PARAMS": "--served-model-name --max-model-len 2048 --enable_lora", + "VLLM_ALLOW_RUNTIME_LORA_UPDATING": "True" +} +``` + +> 📌 These environment variables are subject to change with vLLM versions. + +**LoRA Enablement:** +The `--enable-lora` flag is necessary for vLLM to recognize LoRA adapters. +See [vLLM Docs on LoRA](https://docs.vllm.ai/en/stable/features/lora.html#dynamically-serving-lora-adapters). + +--- + +## Example Stacked Deployment Payload + +```python +# Example Payload for stacked deployment +compartment_id = "compartmentID" +project_id = "projectID" +model_group_id = "StackedModelGroupId" + +payload = { + "displayName": "MMS Model Group Deployment - Stacked", + "description": "mms", + "compartmentId": compartment_id, + "projectId": project_id, + "modelDeploymentConfigurationDetails": { + "deploymentType": "MODEL_GROUP", + "modelGroupConfigurationDetails": { + "modelGroupId": model_group_id + }, + "infrastructureConfigurationDetails": { + "infrastructureType": "INSTANCE_POOL", + "instanceConfiguration": { + "instanceShapeName": "VM.GPU.A10.1" + }, + "scalingPolicy": { + "policyType": "FIXED_SIZE", + "instanceCount": 1 + } + }, + "environmentConfigurationDetails": { + "environmentConfigurationType": "OCIR_CONTAINER", + "serverPort": 8080, + "image": "iad.ocir.io/ociodscdev/dsmc/inferencing/odsc-vllm-serving:0.6.4.post1.1", + "environmentVariables": { + "MODEL_DEPLOY_PREDICT_ENDPOINT": "/v1/completions", + "PARAMS": "--served-model-name --max-lora-rank 64 --max-model-len 4096 --enable_lora", + "PORT": "8080", + "VLLM_ALLOW_RUNTIME_LORA_UPDATING": "true", + "TENSOR_PARALLELISM": "1" + } + } + }, + "categoryLogDetails": { + "access": { + "logGroupId": "", + "logId": "" + }, + "predict": { + "logGroupId": "", + "logId": "" + } + } +} +``` + +--- + +## Predict Call + +> 📌 For using inference key, please pass the inference key value in the "model" key in lieu of the model_id in the below example. + +```python +def predict(model_id_or_inference_key): + predict_url = f'{endpoint}/{md_ocid}/predict' + predict_data = json.dumps({ + "model": model_id_or_inference_key, + "prompt": "[user] Write a SQL query to answer the question based on the table schema.\\n\\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\\n\\n question: Name the ICAO for lilongwe international airport [/user] [assistant]", + "max_tokens": 100, + "temperature": 0 + }) + predict_headers = { + 'Content-Type': 'application/json', + 'opc-request-id': 'test-id' + } + + response = requests.request("POST", predict_url, headers=predict_headers, data=predict_data, auth=auth, verify=False) + util.print_response(response) + +if __name__ == "__main__": + + baseModel = f"{base_model_ocid}" + print("BaseModel", baseModel) + predict(baseModel) + + lora1 = f'{ft_model_ocid2} + print("FT-Weight1", lora1) + predict(lora1) + + + lora2 = f'{ft_model_ocid2} + print("FT-Weight2", lora2) + predict(lora2) +``` + +--- + +## Updating Fine-Tuned Weights + +- Create a new **stacked model group** using either the `CREATE` or `CLONE` operation. +- Once the stacked model group is created, you can **add or remove fine-tuned weight models** from the group. + + +### Sample Create Payload with Updated Weights: + +```python +compartment_id = "compartmentID" +project_id = "projectID" +mg_create_url = f"{endpoint}/modelGroups?compartmentId={compartment_id}" + +mg_payload = json.dumps({ + "createType": "CREATE", + "compartmentId": compartment_id, + "projectId": project_id, + "displayName": "Model Group - Stacked ", + "description": "Test stacked model group", + "modelGroupDetails": { + "type": "STACKED", + "baseModelId": f'{base_model_ocid}' + }, + "memberModelEntries": { + "memberModelDetails": [ + { + "inferenceKey": "basemodel", + "modelId": f'{base_model_ocid}' + }, + { + "inferenceKey": "sql-lora-2", + "modelId": f'{updated_model_ocid_1}' + }, + { + "inferenceKey": "sql-lora-3", + "modelId": f'{updated_model_ocid_2}' + } + ] + } +}) +``` + +--- + +## Live Update +Perform an update of the model deployment with the updated stacked model group id. + +```python +compartment_id = "compartmentID" +project_id = "projectID" +md_ocid = f'{md_ocid}' +endpoint = 'https://modeldeployment-int.us-ashburn-1.oci.oc-test.com' + +update_url = f"{endpoint}/modelDeployments/{mdocid}" +model_group_id_update = "update_modelGroupId" +update_payload = json.dumps({ + "displayName": "MMS Model Group Deployment - Stacked", + "description": "mms", + "compartmentId": compartment_id, + "projectId": project_id, + "modelDeploymentConfigurationDetails": { + "deploymentType": "MODEL_GROUP", + "updateType": "LIVE", + "modelGroupConfigurationDetails": { + "modelGroupId": model_group_id_update + } + } +}) +response = requests.request("PUT", update_url, headers=util.headers, data=update_payload, auth=auth) +``` + +--- + +## Perform /predict call on the model deployment +> 📌 For using inference key, please pass the inference key value in the "model" key in lieu of the model_id in the below example. + +```python +md_ocid = f'{md_ocid} +int_endpoint = 'https://modeldeployment-int.us-ashburn-1.oci.oc-test.com' +endpoint = int_endpoint + +def predict(model_id): + + predict_url = f'{endpoint}/{md_ocid}/predict' + predict_data = json.dumps({ + "model": model_id / inference_key, + "prompt": "[user] Write a SQL query to answer the question based on the table schema.\\n\\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\\n\\n question: Name the ICAO for lilongwe international airport [/user] [assistant]", + "max_tokens": 100, + "temperature": 0 + }) + predict_headers = { + 'Content-Type': 'application/json', + 'opc-request-id': 'test-id' + } + + response = requests.request("POST", predict_url, headers=predict_headers, data=predict_data, auth=auth, verify=False) + util.print_response(response) + + +if __name__ == "__main__": + + baseModel = f'{base_model_ocid} + print("BaseModel", baseModel) + predict(baseModel) + + lora1 = f'{lora_weight_1} + print("lora-1", lora1) + predict(lora1) + + lora2 = f'{lora_weight_2} + print("lora-2", lora2) + predict(lora2) + + lora3 = f'{lora_weight_3} + print("lora-3", lora3) + predict(lora3) +``` + +--- + +## References + +- [vLLM Docs](https://docs.vllm.ai/en/stable/features/lora.html) +- [OCI Docs](https://docs.oracle.com/en-us/iaas/Content/data-science/using/model_dep_create.htm) \ No newline at end of file