Skip to content

Add CI workflows to provision azure VM and run mshv unit tests #213

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
229 changes: 229 additions & 0 deletions .github/workflows/mshv-infra.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
name: MSHV Infra Setup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one meta question, all these runs are cancellable right? So I update the PR previous run gets cancelled and resource group gets deleted and everything.

Copy link
Collaborator Author

@gamora12 gamora12 Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these runs can be cancelled. The cleanup job runs at the end by default even if the workflow is cancelled by the latest PR update. So, if a new commit is pushed, automatically Github will cancel the previous run.

on:
workflow_call:
inputs:
ARCH:
description: 'Architecture for the VM'
required: true
type: string
KEY:
description: 'SSH Key Name'
required: true
type: string
OS_DISK_SIZE:
description: 'OS Disk Size in GB'
required: true
type: string
RG:
description: 'Resource Group Name'
required: true
type: string
VM_SKU:
description: 'VM SKU'
required: true
type: string
secrets:
MI_CLIENT_ID:
required: true
RUNNER_RG:
required: true
STORAGE_ACCOUNT_PATHS:
required: true
ARCH_SOURCE_PATH:
required: true
USERNAME:
required: true
outputs:
PRIVATE_IP:
description: 'Private IP of the VM'
value: ${{ jobs.infra-setup.outputs.PRIVATE_IP }}
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
infra-setup:
name: ${{ inputs.ARCH }} VM Provision
runs-on:
- self-hosted
- Linux
outputs:
PRIVATE_IP: ${{ steps.get-vm-ip.outputs.PRIVATE_IP }}
steps:
- name: Install & login to AZ CLI
env:
MI_CLIENT_ID: ${{ secrets.MI_CLIENT_ID }}
run: |
set -e
echo "Installing Azure CLI if not already installed"
if ! command -v az &>/dev/null; then
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
else
echo "Azure CLI already installed"
fi
az --version
echo "Logging into Azure CLI using Managed Identity"
az login --identity --client-id ${MI_CLIENT_ID}

- name: Get Location
id: get-location
env:
SKU: ${{ inputs.VM_SKU }}
STORAGE_ACCOUNT_PATHS: ${{ secrets.STORAGE_ACCOUNT_PATHS }}
run: |
set -e
# Extract vCPU count from SKU (e.g., "Standard_D2s_v3" => 2)
vcpu=$(echo "$SKU" | sed -n 's/^Standard_[A-Za-z]\+\([0-9]\+\).*/\1/p')
if [[ -z "$vcpu" ]]; then
echo "Cannot extract vCPU count from SKU: $SKU"
exit 1
fi

SUPPORTED_LOCATIONS=$(echo "$STORAGE_ACCOUNT_PATHS" | jq -r 'to_entries[] | .key')

for location in $SUPPORTED_LOCATIONS; do
family=$(az vm list-skus --size "$SKU" --location "$location" --resource-type "virtualMachines" --query '[0].family' -o tsv)
if [[ -z "$family" ]]; then
echo "Cannot determine VM family for SKU: $SKU in $location"
continue
fi

usage=$(az vm list-usage --location "$location" --query "[?name.value=='$family'] | [0]" -o json)
current=$(echo "$usage" | jq -r '.currentValue')
limit=$(echo "$usage" | jq -r '.limit')

if [[ $((limit - current)) -ge $vcpu ]]; then
echo "Sufficient quota found in $location"
echo "location=$location" >> "$GITHUB_OUTPUT"
exit 0
fi
done

echo "No location found with sufficient vCPU quota for SKU: $SKU"
exit 1

- name: Create Resource Group
id: rg-setup
env:
LOCATION: ${{ steps.get-location.outputs.location }}
RG: ${{ inputs.RG }}
STORAGE_ACCOUNT_PATHS: ${{ secrets.STORAGE_ACCOUNT_PATHS }}
run: |
set -e
echo "Creating Resource Group: $RG"
# Create the resource group
echo "Creating resource group in location: ${LOCATION}"
az group create --name ${RG} --location ${LOCATION}
echo "Resource group created successfully."

- name: Generate SSH Key
id: generate-ssh-key
env:
KEY: ${{ inputs.KEY }}
run: |
set -e
echo "Generating SSH key: $KEY"
mkdir -p ~/.ssh
ssh-keygen -t rsa -b 4096 -f ~/.ssh/${KEY} -N ""

- name: Create VM
id: vm-setup
env:
KEY: ${{ inputs.KEY }}
LOCATION: ${{ steps.get-location.outputs.location }}
OS_DISK_SIZE: ${{ inputs.OS_DISK_SIZE }}
RG: ${{ inputs.RG }}
RUNNER_RG: ${{ secrets.RUNNER_RG }}
USERNAME: ${{ secrets.USERNAME }}
VM_SKU: ${{ inputs.VM_SKU }}
VM_IMAGE_NAME: ${{ inputs.ARCH }}_${{ steps.get-location.outputs.location }}_image
VM_NAME: ${{ inputs.ARCH }}_${{ steps.get-location.outputs.location }}_${{ github.run_id }}
run: |
set -e
echo "Creating $VM_SKU VM: $VM_NAME"

# Extract subnet ID from the runner VM
echo "Retrieving subnet ID..."
SUBNET_ID=$(az network vnet list --resource-group ${RUNNER_RG} --query "[?contains(location, '${LOCATION}')].{SUBNETS:subnets}" | jq -r ".[0].SUBNETS[0].id")
if [[ -z "${SUBNET_ID}" ]]; then
echo "ERROR: Failed to retrieve Subnet ID."
exit 1
fi

# Extract image ID from the runner VM
echo "Retrieving image ID..."
IMAGE_ID=$(az image show --resource-group ${RUNNER_RG} --name ${VM_IMAGE_NAME} --query "id" -o tsv)
if [[ -z "${IMAGE_ID}" ]]; then
echo "ERROR: Failed to retrieve Image ID."
exit 1
fi

# Create VM
az vm create \
--resource-group ${RG} \
--name ${VM_NAME} \
--subnet ${SUBNET_ID} \
--size ${VM_SKU} \
--location ${LOCATION} \
--image ${IMAGE_ID} \
--os-disk-size-gb ${OS_DISK_SIZE} \
--public-ip-sku Standard \
--storage-sku Premium_LRS \
--public-ip-address "" \
--admin-username ${USERNAME} \
--ssh-key-value ~/.ssh/${KEY}.pub \
--security-type Standard \
--output json

echo "VM creation process completed successfully."

- name: Get VM Private IP
id: get-vm-ip
env:
RG: ${{ inputs.RG }}
VM_NAME: ${{ inputs.ARCH }}_${{ steps.get-location.outputs.location }}_${{ github.run_id }}
run: |
set -e
echo "Retrieving VM Private IP address..."
# Retrieve VM Private IP address
PRIVATE_IP=$(az vm show -g ${RG} -n ${VM_NAME} -d --query privateIps -o tsv)
if [[ -z "$PRIVATE_IP" ]]; then
echo "ERROR: Failed to retrieve private IP address."
exit 1
fi
echo "PRIVATE_IP=$PRIVATE_IP" >> $GITHUB_OUTPUT

- name: Wait for SSH availability
env:
KEY: ${{ inputs.KEY }}
PRIVATE_IP: ${{ steps.get-vm-ip.outputs.PRIVATE_IP }}
USERNAME: ${{ secrets.USERNAME }}
run: |
echo "Waiting for SSH to be accessible..."
timeout 120 bash -c 'until ssh -o StrictHostKeyChecking=no -i ~/.ssh/${KEY} ${USERNAME}@${PRIVATE_IP} "exit" 2>/dev/null; do sleep 5; done'
echo "VM is accessible!"

- name: Remove Old Host Key
env:
PRIVATE_IP: ${{ steps.get-vm-ip.outputs.PRIVATE_IP }}
run: |
set -e
echo "Removing the old host key"
ssh-keygen -R $PRIVATE_IP

- name: SSH into VM and Install Dependencies
env:
KEY: ${{ inputs.KEY }}
PRIVATE_IP: ${{ steps.get-vm-ip.outputs.PRIVATE_IP }}
USERNAME: ${{ secrets.USERNAME }}
run: |
set -e
ssh -i ~/.ssh/${KEY} -o StrictHostKeyChecking=no ${USERNAME}@${PRIVATE_IP} << EOF
set -e
echo "Logged in successfully."
echo "Installing dependencies..."
sudo tdnf install -y git moby-engine moby-cli clang llvm pkg-config make gcc glibc-devel
echo "Installing Rust..."
curl -sSf https://sh.rustup.rs | sh -s -- --default-toolchain stable --profile default -y
export PATH="\$HOME/.cargo/bin:\$PATH"
cargo --version
EOF
96 changes: 96 additions & 0 deletions .github/workflows/mshv-tests.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
name: Build & Test MSHV Crate
on:
pull_request:
workflow_dispatch:
inputs:
branch:
description: 'Branch to build and test'
required: true
default: 'main'
jobs:
infra-setup:
name: MSHV Infra Setup (x86_64)
uses: ./.github/workflows/mshv-infra.yaml
with:
ARCH: x86_64
KEY: azure_key_${{ github.run_id }}
OS_DISK_SIZE: 512
RG: RUST-VMM-MSHV-${{ github.run_id }}
VM_SKU: Standard_D16s_v5
secrets:
MI_CLIENT_ID: ${{ secrets.MSHV_MI_CLIENT_ID }}
RUNNER_RG: ${{ secrets.MSHV_RUNNER_RG }}
STORAGE_ACCOUNT_PATHS: ${{ secrets.MSHV_STORAGE_ACCOUNT_PATHS }}
ARCH_SOURCE_PATH: ${{ secrets.MSHV_X86_SOURCE_PATH }}
USERNAME: ${{ secrets.MSHV_USERNAME }}

build-test:
name: Build & test
needs: infra-setup
if: ${{ always() && needs.infra-setup.result == 'success' }}
runs-on:
- self-hosted
- Linux
steps:
- name: Determine branch to build
run: |
echo "Determining branch to build and test..."
if [[ "${{ github.event_name }}" == "pull_request" ]]; then
echo "BRANCH=${{ github.event.pull_request.head.ref }}" >> $GITHUB_ENV
else
echo "BRANCH=${{ inputs.branch }}" >> $GITHUB_ENV
fi

- name: Build & Run tests on remote VM
env:
BRANCH_NAME: ${{ env.BRANCH }}
KEY: azure_key_${{ github.run_id }}
PR_NUMBER: ${{ github.event.pull_request.number }}
PRIVATE_IP: ${{ needs.infra-setup.outputs.PRIVATE_IP }}
RG: MSHV-${{ github.run_id }}
USERNAME: ${{ secrets.MSHV_USERNAME }}
run: |
set -e
echo "Connecting to the VM via SSH..."
ssh -i ~/.ssh/${KEY} -o StrictHostKeyChecking=no ${USERNAME}@${PRIVATE_IP} << EOF
set -e
echo "Logged in successfully."
export PATH="\$HOME/.cargo/bin:\$PATH"
echo "${BRANCH_NAME}"
git clone --depth 1 --single-branch --branch "$BRANCH_NAME" https://github.com/rust-vmm/mshv.git
cd mshv
cargo build --all-features --workspace
cargo test --all-features --workspace
EOF
echo "Build and test completed successfully."

cleanup:
name: Cleanup
needs: build-test
if: always()
runs-on:
- self-hosted
- Linux
steps:
- name: Delete RG
env:
RG: RUST-VMM-MSHV-${{ github.run_id }}
run: |
if az group exists --name ${RG}; then
az group delete --name ${RG} --yes --no-wait
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can fail right? Do we have another service which periodically goes through all the stale resource groups and cleans them up?

Copy link
Collaborator Author

@gamora12 gamora12 Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup job will delete the resource group at the end for each run. It will always run even if the previous jobs fail or get cancelled.
image

As of now we don't have a separate service to ensure the cleanup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean cleanup executes az group delete but this operation can fail due to multiple reasons, for example auth issue. then that created resource group will not be deleted by pipeline. So we would need an external clean up agent which checks for locked up resources.

Copy link
Collaborator Author

@gamora12 gamora12 Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we can add a separate cleanup workflow (cron) for this. We can have a separate PR for this.

else
echo "Resource Group ${RG} does not exist. Skipping deletion."
fi
echo "Cleanup process completed."

- name: Delete SSH Key
env:
KEY: azure_key_${{ github.run_id }}
run: |
if [ -f ~/.ssh/${KEY} ]; then
rm -f ~/.ssh/${KEY} ~/.ssh/${KEY}.pub
echo "SSH key deleted successfully."
else
echo "SSH key does not exist. Skipping deletion."
fi
echo "Cleanup process completed."
Loading