aws-neuron
diff --git a/‎.gitignore
Lines changed: 142 additions & 0 deletions b/‎.gitignore
Lines changed: 142 additions & 0 deletions
diff --git a/‎.pre-commit-config.yaml
Lines changed: 25 additions & 0 deletions b/‎.pre-commit-config.yaml
Lines changed: 25 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 128 additions & 8 deletions b/‎README.md
Lines changed: 128 additions & 8 deletions
diff --git a/‎SECURITY.md
Lines changed: 11 additions & 0 deletions b/‎SECURITY.md
Lines changed: 11 additions & 0 deletions
diff --git a/‎build.sh
Lines changed: 24 additions & 0 deletions b/‎build.sh
Lines changed: 24 additions & 0 deletions
@@ -0,0 +1,142 @@
+# Python .gitignore template
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# NxD
+
+build
+.vscode/
+*.iml
+.attach_pid*
+src/neuronx_distributed.egg-info/
+*.whl
+**/.DS_Store
+__pycache__
@@ -0,0 +1,25 @@
+default_language_version:
+  # force all unspecified python hooks to run python3
+  python: python3
+repos:
+- repo: https://github.com/pre-commit/pre-commit-hooks
+  rev: v2.3.0
+  hooks:
+    - id: end-of-file-fixer
+    - id: trailing-whitespace
+    - id: detect-aws-credentials
+- repo: https://github.com/pocc/pre-commit-hooks
+  rev: v1.1.1
+  hooks:
+    - id: clang-format
+      args: [--style=file, -i]
+- repo: https://github.com/astral-sh/ruff-pre-commit
+  rev: v0.5.0
+  hooks:
+    - id: ruff
+      name: ruff
+      entry: ruff
+      args: [check, --fix, "--line-length=120", "--ignore=F401,E203"]
+      types: [python]
+      language: system
+      exclude: cases_update
@@ -1,17 +1,137 @@
 ## My Project
 
-TODO: Fill this README out!
+This package provides a model hub for running inference on Neuronx Distributed (NxD).
 
-Be sure to:
+## Examples
+This package includes examples that you can reference when you implement code that uses NxD Inference.
+* `generation_demo.py` - A basic generation example for Llama.
 
-* Change the title in this README
-* Edit your repository description on GitHub
+## Run inference with the inference demo
+This package includes an inference demo console script that you can use to run inference. This script includes benchmarking and accuracy checking features that are useful for developers to verify that their models and modules work correctly.
 
-## Security
+After you install this package, you can run the inference demo with `inference-demo`. See examples below for how to run the inference demo. You can also run `inference_demo --help` to view all available arguments.
 
-See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
+### Example 1: Llama inference with token matching accuracy check
+```
+inference_demo \
+  --model-type llama \
+  --task-type causal-lm \
+  run \
+    --model-path /home/ubuntu/model_hf/Llama-3.1-8B-Instruct/ \
+    --compiled-model-path /home/ubuntu/traced_model/Llama-3.1-8B-Instruct/ \
+    --torch-dtype bfloat16 \
+    --tp-degree 32 \
+    --batch-size 2 \
+    --max-context-length 32 \
+    --seq-len 64 \
+    --on-device-sampling \
+    --enable-bucketing \
+    --top-k 1 \
+    --do-sample \
+    --pad-token-id 2 \
+    --prompt "I believe the meaning of life is" \
+    --prompt "The color of the sky is" \
+    --check-accuracy-mode token-matching \
+    --benchmark
+```
 
-## License
+### Example 2. DBRX inference with logit matching accuracy check
 
-This project is licensed under the Apache-2.0 License.
+```
+inference_demo \
+  --model-type dbrx \
+  --task-type causal-lm \
+  run \
+    --model-path /home/ubuntu/model_hf/dbrx-1layer/ \
+    --compiled-model-path /home/ubuntu/traced_model/dbrx-1layer-demo/ \
+    --torch-dtype bfloat16 \
+    --tp-degree 32 \
+    --batch-size 2 \
+    --max-context-length 1024 \
+    --seq-len 1152 \
+    --enable-bucketing \
+    --top-k 1 \
+    --do-sample \
+    --pad-token-id 0 \
+    --prompt "I believe the meaning of life is" \
+    --prompt "The color of the sky is" \
+    --check-accuracy-mode logit-matching
+```
 
+### Example 3. Llama with speculation
+
+```
+inference_demo \
+  --model-type llama \
+  --task-type causal-lm \
+  run \
+    --model-path /home/ubuntu/model_hf/open_llama_7b/ \
+    --compiled-model-path /home/ubuntu/traced_model/open_llama_7b/ \
+    --draft-model-path /home/ubuntu/model_hf/open_llama_3b/ \
+    --compiled-draft-model-path /home/ubuntu/traced_model/open_llama_3b/ \
+    --torch-dtype bfloat16 \
+    --tp-degree 32 \
+    --batch-size 1 \
+    --max-context-length 32 \
+    --seq-len 64 \
+    --enable-bucketing \
+    --speculation-length 5 \
+    --no-trace-tokengen-model \
+    --top-k 1 \
+    --do-sample \
+    --pad-token-id 2 \
+    --prompt "I believe the meaning of life is" \
+    --check-accuracy-mode token-matching \
+    --benchmark
+```
+
+### Example 4. Llama with quantization
+
+```
+inference_demo \
+  --model-type llama \
+  --task-type causal-lm \
+  run \
+    --model-path /home/ubuntu/model_hf/Llama-2-7b/ \
+    --compiled-model-path /home/ubuntu/traced_model/Llama-2-7b-demo/ \
+    --torch-dtype bfloat16 \
+    --tp-degree 32 \
+    --batch-size 2 \
+    --max-context-length 32 \
+    --seq-len 64 \
+    --on-device-sampling \
+    --enable-bucketing \
+    --quantized \
+    --quantized-checkpoints-path /home/ubuntu/model_hf/Llama-2-7b/model_quant.pt \
+    --quantization-type per_channel_symmetric \
+    --top-k 1 \
+    --do-sample \
+    --pad-token-id 2 \
+    --prompt "I believe the meaning of life is" \
+    --prompt "The color of the sky is"
+```
+
+### Example 5. Llama inference with logit matching accuracy check using custom error tolerances
+
+```
+inference_demo \
+  --model-type llama \
+  --task-type causal-lm \
+  run \
+    --model-path /home/ubuntu/model_hf/Llama-2-7b/ \
+    --compiled-model-path /home/ubuntu/traced_model/Llama-2-7b-demo/ \
+    --torch-dtype bfloat16 \
+    --tp-degree 32 \
+    --batch-size 2 \
+    --max-context-length 32 \
+    --seq-len 64 \
+    --check-accuracy-mode logit-matching \
+    --divergence-difference-tol 0.005 \
+    --tol-map "{5: (1e-5, 0.02)}" \
+    --enable-bucketing \
+    --top-k 1 \
+    --do-sample \
+    --pad-token-id 2 \
+    --prompt "I believe the meaning of life is" \
+    --prompt "The color of the sky is"
+```
@@ -0,0 +1,11 @@
+## Reporting Security Issues
+
+We take all security reports seriously.
+When we receive such reports,
+we will investigate and subsequently address
+any potential vulnerabilities as quickly as possible.
+If you discover a potential security issue in this project,
+please notify AWS/Amazon Security via our
+[vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/)
+or directly via email to [AWS Security](mailto:[email protected]).
+Please do *not* create a public GitHub issue in this project.
@@ -0,0 +1,24 @@
+#! /bin/bash
+set -e
+
+: ${BUILD_PATH:=build}
+
+python3.8 -m pip install ruff
+# remove --exit-zero once all errors are fixed/explicitly ignore
+python3.8 -m ruff check --line-length=120 --ignore=F401,E203
+# exit when asked to run `ruff` only
+if [[ "$1" == "ruff" ]]
+then
+  exit 0
+fi
+
+# Run static code analysis
+python3.8 -m pip install mypy
+python3.8 -m mypy --no-incremental || true
+# exit when asked to run `mypy` only
+if [[ "$1" == "mypy" ]]
+then
+  exit 0
+fi
+
+python3.8 setup.py bdist_wheel --dist-dir ${BUILD_PATH}/pip/public/neuronx-distributed-inference