Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
5ef5ac4
feat(inference): add multi api server to llama2-70b
mrzzy Nov 7, 2025
6cd9d00
fix(llama2-70b): param name mismatch api_server <> api_servers
mrzzy Nov 7, 2025
33df1c5
fix(llama2-70b): typo api_server param should have no 's'
mrzzy Nov 7, 2025
064c650
revert(llama2-70b): restore original implementation of IssueQuery
mrzzy Nov 7, 2025
f2a7672
feat(llama2-70b): log exception when issuing queries for debugging
mrzzy Nov 7, 2025
7198e13
fix(llama2-70b): batch_size param not effective
mrzzy Nov 7, 2025
a588809
fix(llama2-70b): increase timeouts to give more time to complete request
mrzzy Nov 7, 2025
107ba62
revert(llama2-70b): "fix(llama2-70b): batch_size param not effective"
mrzzy Nov 7, 2025
e76e66d
fix(llama2-70b): event loop closed while issuing requests
mrzzy Nov 7, 2025
d70c520
revert: "fix(llama2-70b): event loop closed while issuing requests"
mrzzy Nov 7, 2025
7869630
fix(llama2-70b): event loop closed while issuing requests
mrzzy Nov 7, 2025
ed1b21c
feat(llama2-70b): more generous 1 hr timeout for requests
mrzzy Nov 7, 2025
2d2a145
test(llama2-70b): fix typo in param
mrzzy Nov 7, 2025
e5d4b06
perf(llama2-70b): send entire batch of prompts to api server instead …
mrzzy Nov 7, 2025
737a1c6
refactor: pass httpx as param instead of instance param
mrzzy Nov 7, 2025
38a7356
perf(llama2-70b): copy performance settings from nvidia implementation
mrzzy Nov 7, 2025
ef020d2
perf(llama2-70b): shuffle to evenly distribute prompt load over servers
mrzzy Nov 10, 2025
670b0d3
fix(llama2-70b): missing arguments passing in SUTServer
mrzzy Nov 15, 2025
9fd61c1
build(llama2-70b): add pip modules needed to run updated SUT_API
mrzzy Nov 15, 2025
a145d08
build(llama2-70b): add dev dependencies required to run unit tests
mrzzy Nov 15, 2025
9ad16a2
docs(llama2-70b): document how to use multinode SUT_API
mrzzy Nov 15, 2025
1ccc3d0
revert(llama2-70b): "perf(llama2-70b): shuffle to evenly distribute p…
mrzzy Nov 15, 2025
7c39ac5
Merge commit '8d1c7dfb890839b52568c8f3483887f023231243'
mrzzy Nov 15, 2025
22d6b5b
fix(llama2-70b): accelerate version in requirements.txt too old
mrzzy Nov 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions language/llama2-70b/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Contributing

## Unit Tests

To run unit tests for the LLaMA 2 70B implementation, install development dependencies

```bash
pip install -r requirements-dev.txt
```
36 changes: 34 additions & 2 deletions language/llama2-70b/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ conda activate llama2-70b
# Install packages
conda install pybind11==2.10.4 -c conda-forge -y
python -m pip install torch==2.2.0.dev20231006+cpu --index-url https://download.pytorch.org/whl/nightly/cpu
pip install transformers==4.31.0 nltk==3.8.1 evaluate==0.4.0 absl-py==1.4.0 rouge-score==0.1.2 sentencepiece==0.1.99 accelerate==0.21.0
pip install transformers==4.31.0 nltk==3.8.1 evaluate==0.4.0 absl-py==1.4.0 rouge-score==0.1.2 sentencepiece==0.1.99 accelerate==1.11.0 httpx==0.28.1 more_itertools==10.8.0

export CUR_DIR=${PWD}
cd <inference-repo-root>/loadgen
Expand Down Expand Up @@ -187,6 +187,23 @@ python3 -u main.py --scenario Offline \
--device cuda:0 2>&1 | tee offline_performance_log.log
```

For models hosted over an OpenAI-compatible LLM API endpoint (eg. via VLLM, TensorRT-LLM):

```
python3 -u main.py --scenario Offline \
--vllm \
--api-model-name ${MODEL_NAME} \
--api-server ${API_BASE} \
--model-path ${CHECKPOINT_PATH} \
--user-conf user.conf \
--total-sample-count 24576 \
--dataset-path ${DATASET_PATH} \
--output-log-dir offline-logs
```

- `<API_BASE>` is the base URL of the OpenAI-compatible endpoint eg. `http://server1:8000/`
- **Multinode** multiple LLM API endpoints can be provided by specifying `--api-server` multiple times.

### Server
```
python -u main.py --scenario Server \
Expand All @@ -199,7 +216,7 @@ python -u main.py --scenario Server \
--output-log-dir server-logs
```

The ServerSUT was not tested for GPU runs.
The ServerSUT was not tested for GPU or LLM API runs.


## Run Accuracy Benchmarks
Expand All @@ -210,6 +227,7 @@ OUTPUT_LOG_DIR=offline-accuracy-logs

mkdir -p "run_outputs" # The script will dump all the outputs to 'run_outputs'.

# for normal runs:
python -u main.py --scenario Offline \
--model-path ${CHECKPOINT_PATH} \
--accuracy \
Expand Down Expand Up @@ -241,6 +259,20 @@ python consolidate_results.py --dataset-path ${DATASET_PATH} --model-dir ${CHECK
For the GPU run - The above steps have been automated in `run_accuracy.sh`. You can also modify this script to use
`--device cpu` to adapt it to a CPU-only run.

For models hosted over an OpenAI-compatible LLM API endpoint,
replace the `python -m main.py` command normal run instructions with:
```sh
python3 -u main.py --scenario Offline \
--vllm \
--api-model-name ${MODEL_NAME} \
--api-server ${API_BASE} \
--model-path ${CHECKPOINT_PATH} \
--user-conf user.conf \
--total-sample-count 24576 \
--dataset-path ${DATASET_PATH} \
--output-log-dir ${OUTPUT_LOG_DIR} \
--accuracy
```

### Server
```
Expand Down
Loading
Loading