Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions nemoguardrails/benchmark/Procfile
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see hardcoded ports in multiple places

The ports (8000, 8001, 9000) are hardcoded in:

  • Procfile
  • validate_mocks.py
  • config.yml

If someone changes one location, they must remember to change all three. consider:

  • environment variables
  • a shared config file (config.py that reads from env vars with defaults)
  • or at minimum, we should document this dependency

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a limitation at the moment. This is intended for local offline testing, to quantify performance regressions in around the same time as unit-tests take today. I could have a central config which then propagates out to the Guardrails configs, Mock LLM configs, and Procfile to tie all three together and keep them consistent. But I'd prefer to keep it simple and add more documentation to help guide people configure everything. Will get a readme written up for this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Procfile has relative paths that assume we're running from the nemoguardrails/benchmark directory:

If we run honcho from the project root, these paths won't work. we need either:

  • absotlute paths from project root
  • clear documentation that honcho start must run from nemoguardrails/benchmark/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intentional, the Procfile is only intended for use in the benchmarking directory. If it was at the project root, I'd expect it to spin up a production Guardrails set of services rather than a set of mocked LLMs and Guardrails. I'll add a proper README to explain all this

Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Procfile

# NeMo Guardrails server
gr: poetry run nemoguardrails server --config configs/guardrail_configs --default-config-id content_safety_colang1 --port 9000

# Guardrails NIMs for inference
app_llm: poetry run python mock_llm_server/run_server.py --workers 4 --port 8000 --config-file configs/mock_configs/meta-llama-3.3-70b-instruct.env
cs_llm: poetry run python mock_llm_server/run_server.py --workers 4 --port 8001 --config-file configs/mock_configs/nvidia-llama-3.1-nemoguard-8b-content-safety.env
159 changes: 159 additions & 0 deletions nemoguardrails/benchmark/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# Guardrails Benchmarking

NeMo Guardrails includes benchmarking tools to help users capacity-test their Guardrails applications.
Adding guardrails to an LLM-based application improves safety and security, while adding some latency. These benchmarks allow users to quantify the tradeoff between security and latency, to make data-driven decisions.
We currently have a simple testbench, which runs the Guardrails server with mocks as Guardrail and Application models. This can be used for performance-testing on a laptop without any GPUs, and run in a few minutes.

## Guardrails Core Benchmarking

This benchmark measures the performance of the Guardrails application, running on CPU-only laptop or instance.
It doesn't require GPUs on which to run local models, or access to the internet to use models hosted by providers.
All models use the [Mock LLM Server](mock_llm_server), which is a simplified model of an LLM used for inference.
The aim of this benchmark is to detect performance-regressions as quickly as running unit-tests.

## Quickstart: Running Guardrails with Mock LLMs
To run Guardrails with mocks for both the content-safety and main LLM, follow the steps below. All commands must be run in the `nemoguardrails/benchmark` directory. These assume you already have a working environment after following the [contribution guidelines](../CONTRIBUTING.md).

First, we need to install the honcho and langchain-nvidia-ai-endpoints packages.
The `honcho` package is used to run Procfile-based applications, and is a Python port of [Foreman](https://github.com/ddollar/foreman).
The `langchain-nvidia-ai-endpoints` package is used to communicate with Mock LLMs via Langchain.

```shell
# Install dependencies
$ poetry run pip install honcho langchain-nvidia-ai-endpoints
...
Successfully installed filetype-1.2.0 honcho-2.0.0 langchain-nvidia-ai-endpoints-0.3.19
```

Now we can start up the processes that are part of the [Procfile](Procfile).
As the Procfile processes spin up, they log to the console with a prefix. The `system` prefix is used by Honcho, `app_llm` is the Application or Main LLM mock, `cs_llm` is the content-safety mock, and `gr` is the Guardrails service. We'll explore the Procfile in more detail below.
Once the three 'Uvicorn running on ...' messages are printed, you can move to the next step. Note these messages are likely not on consecutive lines.

```
# All commands must be run in the nemoguardrails/benchmark directory
$ cd nemoguardrails/benchmark
$ poetry run honcho start
13:40:33 system | gr.1 started (pid=93634)
13:40:33 system | app_llm.1 started (pid=93635)
13:40:33 system | cs_llm.1 started (pid=93636)
...
13:40:41 app_llm.1 | INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
...
13:40:41 cs_llm.1 | INFO: Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)
...
13:40:45 gr.1 | INFO: Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)
```

Once Guardrails and the mock servers are up, we can use the `validate_mocks.py` script to check they're healthy and serving the correct models.

```shell
$ cd nemoguardrails/benchmark
$ poetry run python validate_mocks.py
Starting LLM endpoint health check...

--- Checking Port: 8000 ---
Checking http://localhost:8000/health ...
HTTP Request: GET http://localhost:8000/health "HTTP/1.1 200 OK"
Health Check PASSED: Status is 'healthy'.
Checking http://localhost:8000/v1/models for 'meta/llama-3.3-70b-instruct'...
HTTP Request: GET http://localhost:8000/v1/models "HTTP/1.1 200 OK"
Model Check PASSED: Found 'meta/llama-3.3-70b-instruct' in model list.
--- Port 8000: ALL CHECKS PASSED ---

--- Checking Port: 8001 ---
Checking http://localhost:8001/health ...
HTTP Request: GET http://localhost:8001/health "HTTP/1.1 200 OK"
Health Check PASSED: Status is 'healthy'.
Checking http://localhost:8001/v1/models for 'nvidia/llama-3.1-nemoguard-8b-content-safety'...
HTTP Request: GET http://localhost:8001/v1/models "HTTP/1.1 200 OK"
Model Check PASSED: Found 'nvidia/llama-3.1-nemoguard-8b-content-safety' in model list.
--- Port 8001: ALL CHECKS PASSED ---

--- Checking Port: 9000 (Rails Config) ---
Checking http://localhost:9000/v1/rails/configs ...
HTTP Request: GET http://localhost:9000/v1/rails/configs "HTTP/1.1 200 OK"
HTTP Status PASSED: Got 200.
Body Check PASSED: Response is an array with at least one entry.
--- Port 9000: ALL CHECKS PASSED ---

--- Final Summary ---
Port 8000 (meta/llama-3.3-70b-instruct): PASSED
Port 8001 (nvidia/llama-3.1-nemoguard-8b-content-safety): PASSED
Port 9000 (Rails Config): PASSED
---------------------
Overall Status: All endpoints are healthy!
```

Once the mocks and Guardrails are running and the script passes, we can issue curl requests against the Guardrails `/chat/completions` endpoint to generate a response and test the system end-to-end.

```shell
curl -s -X POST http://0.0.0.0:9000/v1/chat/completions \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama-3.3-70b-instruct",
"messages": [
{
"role": "user",
"content": "what can you do for me?"
}
],
"stream": false
}' | jq
{
"messages": [
{
"role": "assistant",
"content": "I can provide information and help with a wide range of topics, from science and history to entertainment and culture. I can also help with language-related tasks, such as translation and text summarization. However, I can't assist with requests that involve harm or illegal activities."
}
]
}

```

## Deep-Dive: Configuration

In this section, we'll examine the configuration files used in the quickstart above. This gives more context on how the system works, and can be extended as needed.

### Procfile

The [Procfile](Procfile?raw=true) contains all the processes that make up the application.
The Honcho package reads in this file, starts all the processes, and combines their logs to the console
The `gr` line runs the Guardrails server on port 9000 and sets the default Guardrails configuration as [content_safety_colang1](configs/guardrail_configs/content_safety_colang1?raw=true).
The `app_llm` line runs the Application or Main Mock LLM. Guardrails calls this LLM to generate a response to the user's query. This server uses 4 uvicorn workers and runs on port 8000. The configuration file here is a Mock LLM configuration, not a Guardrails configuration.
The `cs_llm` line runs the Content-Safety Mock LLM. This uses 4 uvicorn workers and runs on port 8001.

### Guardrails Configuration
The [Guardrails Configuration](configs/guardrail_configs/content_safety_colang1/config.yml) is used by the Guardrails server.
Under the `models` section, the `main` model is used to generate responses to the user queries. The base URL for this model is the `app_llm` Mock LLM from the Procfile, running on port 8000. The `model` field has to match the Mock LLM model name.
The `content_safety` model is configured for use in an input and output rail. The `type` field matches the `$model` used in the input and output flows.

### Mock LLM Endpoints
The Mock LLM implements a subset of the OpenAI LLM API.
There are two Mock LLM configurations, one for the Mock [main model](configs/mock_configs/meta-llama-3.3-70b-instruct.env), and another for the Mock [content-safety](configs/mock_configs/nvidia-llama-3.1-nemoguard-8b-content-safety.env) model.
The Mock LLM has the following OpenAI-compatible endpoints:

* `/health`: Returns a JSON object with status set to healthy and timestamp in seconds-since-epoch. For example `{"status":"healthy","timestamp":1762781239}`
* `/v1/models`: Returns the `MODEL` field from the Mock configuration (see below). For example `{"object":"list","data":[{"id":"meta/llama-3.3-70b-instruct","object":"model","created":1762781290,"owned_by":"system"}]}`
* `/v1/completions`: Returns an [OpenAI completion object](https://platform.openai.com/docs/api-reference/completions/object) using the Mock configuration (see below).
* `/v1/chat/completions`: Returns an [OpenAI chat completion object](https://platform.openai.com/docs/api-reference/chat/object) using the Mock configuration (see below).

### Mock LLM Configuration
Mock LLMs are configured using the `.env` file format. These files are passed to the Mock LLM using the `--config-file` argument.
The Mock LLMs return either a `SAFE_TEXT` or `UNSAFE_TEXT` response to `/v1/completions` or `/v1/chat/completions` inference requests.
The probability of the `UNSAFE_TEXT` being returned if given by `UNSAFE_PROBABILITY`.
The latency of each response is also controllable, and works as follows:

* Latency is first sampled from a normal distribution with mean `LATENCY_MEAN_SECONDS` and standard deviation `LATENCY_STD_SECONDS`.
* If the sampled value is less than `LATENCY_MIN_SECONDS`, it is set to `LATENCY_MIN_SECONDS`.
* If the sampled value is less than `LATENCY_MAX_SECONDS`, it is set to `LATENCY_MAX_SECONDS`.

The full list of configuration fields is shown below:
* `MODEL`: The Model name served by the Mock LLM. This will be returned on the `/v1/models` endpoint.
* `UNSAFE_PROBABILITY`: Probability of an unsafe response. This is a probability, and must be in the range [0, 1].
* `UNSAFE_TEXT`: String returned as an unsafe response.
* `SAFE_TEXT`: String returned as a safe response.
* `LATENCY_MIN_SECONDS`: Minimum latency in seconds.
* `LATENCY_MAX_SECONDS`: Maximum latency in seconds.
* `LATENCY_MEAN_SECONDS`: Normal distribution mean from which to sample latency.
* `LATENCY_STD_SECONDS`: Normal distribution standard deviation from which to sample latency.
10 changes: 8 additions & 2 deletions nemoguardrails/benchmark/mock_llm_server/run_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,12 @@ def parse_arguments():
parser.add_argument(
"--config-file", help=".env file to configure model", required=True
)

parser.add_argument(
"--workers",
type=int,
default=1,
help="Number of uvicorn worker processes (default: 1)",
)
return parser.parse_args()


Expand Down Expand Up @@ -104,12 +109,13 @@ def main(): # pragma: no cover

try:
uvicorn.run(
"api:app",
"nemoguardrails.benchmark.mock_llm_server.api:app",
host=args.host,
port=args.port,
reload=args.reload,
log_level=args.log_level,
env_file=config_file,
workers=args.workers,
)
except KeyboardInterrupt:
log.info("\nServer stopped by user")
Expand Down
Loading