diff --git a/IaC/cdk/README.md b/IaC/cdk/README.md new file mode 100644 index 0000000000..bc96c15ea0 --- /dev/null +++ b/IaC/cdk/README.md @@ -0,0 +1,64 @@ +# CDK Projects + +AWS CDK (Cloud Development Kit) infrastructure as code implementations. + +## Projects + +### aws-resources-cleanup +Comprehensive AWS resource cleanup Lambda with CDK deployment. + +**Purpose**: Automated cleanup of EC2 instances, EKS clusters, and OpenShift infrastructure based on TTL policies and billing tags. + +**Features**: +- TTL-based expiration (8h, 24h policies) +- Billing tag validation (category + Unix timestamps) +- EKS CloudFormation deletion +- OpenShift comprehensive cleanup (VPC, ELB, Route53, S3, NAT, security groups) +- DRY_RUN mode (default) +- SNS notifications +- Hourly EventBridge schedule + +**Quick Start**: +```bash +cd aws-resources-cleanup +just install # Install dependencies +just deploy # Deploy in DRY_RUN mode +just logs # Tail CloudWatch logs +``` + +📖 **Full documentation**: [aws-resources-cleanup/README.md](aws-resources-cleanup/README.md) + +## Requirements + +- AWS CLI configured with appropriate profile +- `uv` package manager: `brew install uv` +- `just` task runner: `brew install just` + +## Common Commands + +All projects use Justfile for consistent automation: + +| Command | Description | +|---------|-------------| +| `just install` | Install all dependencies | +| `just synth` | Generate CloudFormation template | +| `just diff` | Preview infrastructure changes | +| `just deploy` | Deploy stack | +| `just destroy` | Remove stack | +| `just logs` | Tail CloudWatch logs (if applicable) | + +## Adding New CDK Projects + +When creating a new CDK project in this directory: + +1. Create project directory: `mkdir project-name` +2. Initialize CDK: `cdk init app --language python` +3. Add Justfile for automation +4. Add project-specific README.md +5. Update this README with project description + +## Resources + +- [AWS CDK Documentation](https://docs.aws.amazon.com/cdk/) +- [CDK Python API Reference](https://docs.aws.amazon.com/cdk/api/v2/python/) +- [Justfile Documentation](https://github.com/casey/just) diff --git a/IaC/cdk/aws-resources-cleanup/.gitignore b/IaC/cdk/aws-resources-cleanup/.gitignore new file mode 100644 index 0000000000..91a127762f --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/.gitignore @@ -0,0 +1,57 @@ +# CDK +cdk.out/ +.cdk.staging/ +cdk.context.json + +# Python +__pycache__/ +*.py[cod] +*$py.class +*.so +.Python +build/ +develop-eggs/ +dist/ +downloads/ +eggs/ +.eggs/ +lib/ +lib64/ +parts/ +sdist/ +var/ +wheels/ +*.egg-info/ +.installed.cfg +*.egg +MANIFEST + +# Virtual environments +venv/ +ENV/ +env/ +.venv + +# Testing +.pytest_cache/ +.coverage +htmlcov/ +.tox/ + +# IDEs +.vscode/ +.idea/ +*.swp +*.swo +*~ + +# OS +.DS_Store +Thumbs.db + +# Lambda artifacts +*.zip +/tmp/ + +# Logs +*.log diff --git a/IaC/cdk/aws-resources-cleanup/README.md b/IaC/cdk/aws-resources-cleanup/README.md new file mode 100644 index 0000000000..e5cad36b5a --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/README.md @@ -0,0 +1,95 @@ +# AWS Resources Cleanup + +Automated Lambda for EC2, EBS, EKS, and OpenShift cleanup across AWS regions. + +**Runtime**: Python 3.13 ARM64, 1024MB, 600s timeout +**Default**: DRY_RUN mode (logs only) +**Concurrency**: 1 (prevents race conditions) + +## Features + +- **EC2**: TTL expiration, stop policy, long-stopped instances, untagged cleanup +- **EBS**: Unattached volume deletion +- **EKS**: CloudFormation stack deletion (skip pattern: `pe-.*`) +- **OpenShift**: Full cluster cleanup (VPC, ELB, Route53, S3) + +**Protection**: Persistent tags (`jenkins-*`, `pmm-dev`), valid billing tags, `PerconaKeep`, "do not remove" in names + +## Quick Start + +```bash +brew install uv just +cd IaC/cdk/aws-resources-cleanup +just install +just bootstrap # First time only +just deploy # DRY_RUN mode +just deploy-live # LIVE mode (destructive!) +``` + +## Commands + +```bash +just deploy # Deploy (DRY_RUN) +just logs # Tail logs +just invoke-aws # Manual trigger +just params # Show config +just test # Run tests (176, 87% coverage) +``` + +Run `just` for all commands. + +## Configuration + +Key parameters (CloudFormation): + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `DryRunMode` | `true` | Safe mode | +| `ScheduleRateMinutes` | `15` | Run frequency | +| `TargetRegions` | `all` | Regions to scan | +| `LogLevel` | `INFO` | Log verbosity | +| `UntaggedThresholdMinutes` | `30` | Grace period | +| `VolumeCleanupEnabled` | `true` | Enable volume cleanup | +| `EKSCleanupEnabled` | `true` | Enable EKS cleanup | +| `OpenShiftCleanupEnabled` | `true` | Enable OpenShift cleanup | + +View all: `just params` + +## Cleanup Policies + +Priority order: + +1. **TTL** - `creation-time` + `delete-cluster-after-hours` → TERMINATE +2. **Stop** - `stop-after-days` → STOP +3. **Long Stopped** - >30 days → TERMINATE +4. **Untagged** - Missing `iit-billing-tag` → TERMINATE + +## Logging + +``` +Instance i-0d09... protected: Valid billing tag 'ps-package-testing' +[DRY-RUN] Would TERMINATE instance i-085e... in us-east-2: Missing billing tag +Instance scan for us-west-2: 11 scanned, 1 actions, 10 protected +Cleanup complete: 31 actions across 17 regions (15.4s) +``` + +## Troubleshooting + +```bash +just logs-recent # Check logs +just params # Verify config +just invoke-aws # Test manually +``` + +**Issues:** +- No actions: Set `DryRunMode=false` +- Volume cleanup fails: Check `VolumeCleanupEnabled=true`, volumes `available` +- OpenShift errors: Auto-retries 3 times + +## Architecture + +``` +EventBridge → Lambda → EC2/Volumes/EKS/OpenShift → SNS +``` + +Justfile retrieves function name from CDK outputs for alignment. diff --git a/IaC/cdk/aws-resources-cleanup/app.py b/IaC/cdk/aws-resources-cleanup/app.py new file mode 100644 index 0000000000..b01a9b3e59 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/app.py @@ -0,0 +1,25 @@ +#!/usr/bin/env python3 +"""CDK app for AWS Resources Cleanup Lambda.""" + +import os +import aws_cdk as cdk +from stacks.resource_cleanup_stack import ResourceCleanupStack + +app = cdk.App() + +ResourceCleanupStack( + app, + "AWSResourcesCleanupStack", + description="Comprehensive AWS resource cleanup: EC2, EKS, OpenShift infrastructure", + env=cdk.Environment( + account=os.getenv('CDK_DEFAULT_ACCOUNT'), + region=os.getenv('CDK_DEFAULT_REGION', 'us-east-2') + ), + tags={ + "Project": "PlatformEngineering", + "ManagedBy": "CDK", + "iit-billing-tag": "resource-cleanup" + } +) + +app.synth() diff --git a/IaC/cdk/aws-resources-cleanup/cdk.json b/IaC/cdk/aws-resources-cleanup/cdk.json new file mode 100644 index 0000000000..92a7592a42 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/cdk.json @@ -0,0 +1,114 @@ +{ + "app": "python3 app.py", + "watch": { + "include": [ + "**" + ], + "exclude": [ + "README.md", + "cdk*.json", + "requirements*.txt", + "source.bat", + "**/__pycache__", + "**/*.pyc", + ".pytest_cache" + ] + }, + "context": { + "@aws-cdk/aws-lambda:recognizeLayerVersion": true, + "@aws-cdk/core:checkSecretUsage": true, + "@aws-cdk/core:target-partitions": [ + "aws", + "aws-cn" + ], + "@aws-cdk-containers/ecs-service-extensions:enableDefaultLogDriver": true, + "@aws-cdk/aws-ec2:uniqueImdsv2TemplateName": true, + "@aws-cdk/aws-ecs:arnFormatIncludesClusterName": true, + "@aws-cdk/aws-iam:minimizePolicies": true, + "@aws-cdk/core:validateSnapshotRemovalPolicy": true, + "@aws-cdk/aws-codepipeline:crossAccountKeyAliasStackSafeResourceName": true, + "@aws-cdk/aws-s3:createDefaultLoggingPolicy": true, + "@aws-cdk/aws-sns-subscriptions:restrictSqsDescryption": true, + "@aws-cdk/aws-apigateway:disableCloudWatchRole": true, + "@aws-cdk/core:enablePartitionLiterals": true, + "@aws-cdk/aws-events:eventsTargetQueueSameAccount": true, + "@aws-cdk/aws-iam:standardizedServicePrincipals": true, + "@aws-cdk/aws-ecs:disableExplicitDeploymentControllerForCircuitBreaker": true, + "@aws-cdk/aws-iam:importedRoleStackSafeDefaultPolicyName": true, + "@aws-cdk/aws-s3:serverAccessLogsUseBucketPolicy": true, + "@aws-cdk/aws-route53-patternslibrary:useCertificate": true, + "@aws-cdk/customresources:installLatestAwsSdkDefault": false, + "@aws-cdk/aws-rds:databaseProxyUniqueResourceName": true, + "@aws-cdk/aws-codedeploy:removeAlarmsFromDeploymentGroup": true, + "@aws-cdk/aws-apigateway:authorizerChangeDeploymentLogicalId": true, + "@aws-cdk/aws-ec2:launchTemplateDefaultUserData": true, + "@aws-cdk/aws-secretsmanager:useAttachedSecretResourcePolicyForSecretTargetAttachments": true, + "@aws-cdk/aws-redshift:columnId": true, + "@aws-cdk/aws-stepfunctions-tasks:enableEmrServicePolicyV2": true, + "@aws-cdk/aws-ec2:restrictDefaultSecurityGroup": true, + "@aws-cdk/aws-apigateway:requestValidatorUniqueId": true, + "@aws-cdk/aws-kms:aliasNameRef": true, + "@aws-cdk/aws-autoscaling:generateLaunchTemplateInsteadOfLaunchConfig": true, + "@aws-cdk/core:includePrefixInUniqueNameGeneration": true, + "@aws-cdk/aws-efs:denyAnonymousAccess": true, + "@aws-cdk/aws-opensearchservice:enableOpensearchMultiAzWithStandby": true, + "@aws-cdk/aws-lambda-nodejs:useLatestRuntimeVersion": true, + "@aws-cdk/aws-efs:mountTargetOrderInsensitiveLogicalId": true, + "@aws-cdk/aws-rds:auroraClusterChangeScopeOfInstanceParameterGroupWithEachParameters": true, + "@aws-cdk/aws-appsync:useArnForSourceApiAssociationIdentifier": true, + "@aws-cdk/aws-rds:preventRenderingDeprecatedCredentials": true, + "@aws-cdk/aws-codepipeline-actions:useNewDefaultBranchForCodeCommitSource": true, + "@aws-cdk/aws-cloudwatch-actions:changeLambdaPermissionLogicalIdForLambdaAction": true, + "@aws-cdk/aws-codepipeline:crossAccountKeysDefaultValueToFalse": true, + "@aws-cdk/aws-codepipeline:defaultPipelineTypeToV2": true, + "@aws-cdk/aws-kms:reduceCrossAccountRegionPolicyScope": true, + "@aws-cdk/aws-eks:nodegroupNameAttribute": true, + "@aws-cdk/aws-ec2:ebsDefaultGp3Volume": true, + "@aws-cdk/aws-ecs:removeDefaultDeploymentAlarm": true, + "@aws-cdk/custom-resources:logApiResponseDataPropertyTrueDefault": false, + "@aws-cdk/aws-s3:keepNotificationInImportedBucket": false, + "@aws-cdk/aws-apigateway:usagePlanKeyOrderInsensitiveId": true, + "@aws-cdk/aws-appsync:appSyncGraphQLAPIScopeLambdaPermission": true, + "@aws-cdk/aws-cloudfront:defaultSecurityPolicyTLSv1.2_2021": true, + "@aws-cdk/aws-dynamodb:resourcePolicyPerReplica": true, + "@aws-cdk/aws-dynamodb:retainTableReplica": true, + "@aws-cdk/aws-ec2-alpha:useResourceIdForVpcV2Migration": false, + "@aws-cdk/aws-ec2:bastionHostUseAmazonLinux2023ByDefault": true, + "@aws-cdk/aws-ec2:ec2SumTImeoutEnabled": true, + "@aws-cdk/aws-ec2:requirePrivateSubnetsForEgressOnlyInternetGateway": true, + "@aws-cdk/aws-ecs-patterns:secGroupsDisablesImplicitOpenListener": true, + "@aws-cdk/aws-ecs:disableEcsImdsBlocking": true, + "@aws-cdk/aws-ecs:enableImdsBlockingDeprecatedFeature": false, + "@aws-cdk/aws-ecs:reduceEc2FargateCloudWatchPermissions": true, + "@aws-cdk/aws-elasticloadbalancingV2:albDualstackWithoutPublicIpv4SecurityGroupRulesDefault": true, + "@aws-cdk/aws-events:requireEventBusPolicySid": true, + "@aws-cdk/aws-iam:oidcRejectUnauthorizedConnections": true, + "@aws-cdk/aws-kms:applyImportedAliasPermissionsToPrincipal": true, + "@aws-cdk/aws-lambda-nodejs:sdkV3ExcludeSmithyPackages": true, + "@aws-cdk/aws-lambda:createNewPoliciesWithAddToRolePolicy": false, + "@aws-cdk/aws-lambda:recognizeVersionProps": true, + "@aws-cdk/aws-lambda:useCdkManagedLogGroup": true, + "@aws-cdk/aws-rds:lowercaseDbIdentifier": true, + "@aws-cdk/aws-rds:setCorrectValueForDatabaseInstanceReadReplicaInstanceResourceId": true, + "@aws-cdk/aws-route53-patters:useCertificate": true, + "@aws-cdk/aws-route53-targets:userPoolDomainNameMethodWithoutCustomResource": true, + "@aws-cdk/aws-s3:publicAccessBlockedByDefault": true, + "@aws-cdk/aws-s3:setUniqueReplicationRoleName": true, + "@aws-cdk/aws-signer:signingProfileNamePassedToCfn": true, + "@aws-cdk/aws-stepfunctions-tasks:fixRunEcsTaskPolicy": true, + "@aws-cdk/aws-stepfunctions-tasks:useNewS3UriParametersForBedrockInvokeModelTask": true, + "@aws-cdk/aws-stepfunctions:useDistributedMapResultWriterV2": true, + "@aws-cdk/cognito:logUserPoolClientSecretValue": false, + "@aws-cdk/core:aspectPrioritiesMutating": true, + "@aws-cdk/core:aspectStabilization": true, + "@aws-cdk/core:cfnIncludeRejectComplexResourceUpdateCreatePolicyIntrinsics": true, + "@aws-cdk/core:enableAdditionalMetadataCollection": true, + "@aws-cdk/core:explicitStackTags": true, + "@aws-cdk/core:newStyleStackSynthesis": true, + "@aws-cdk/core:stackRelativeExports": true, + "@aws-cdk/pipelines:reduceAssetRoleTrustScope": true, + "@aws-cdk/pipelines:reduceCrossAccountActionRoleTrustScope": true, + "@aws-cdk/pipelines:reduceStageRoleTrustScope": true, + "@aws-cdk/s3-notifications:addS3TrustKeyPolicyForSnsSubscriptions": true + } +} diff --git a/IaC/cdk/aws-resources-cleanup/justfile b/IaC/cdk/aws-resources-cleanup/justfile new file mode 100644 index 0000000000..f82f74f9c1 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/justfile @@ -0,0 +1,275 @@ +# AWS Resources Cleanup - CDK Deployment Automation +# Usage: just + +# Default AWS profile and region +profile := env_var_or_default("AWS_PROFILE", "percona-dev-admin") +region := env_var_or_default("AWS_REGION", "us-east-2") + +# Lambda function name from CDK stack outputs (dynamically retrieved) +# This ensures justfile always uses the function name defined in the CDK stack +# Falls back to "LambdaAWSResourceCleanup" if stack doesn't exist yet +lambda_function := `aws cloudformation describe-stacks --stack-name AWSResourcesCleanupStack --profile percona-dev-admin --region us-east-2 --query "Stacks[0].Outputs[?OutputKey=='LambdaFunctionName'].OutputValue | [0]" --output text 2>/dev/null || echo "LambdaAWSResourceCleanup"` + +# Default recipe - show available commands +default: + @echo "AWS Resources Cleanup - CDK Deployment" + @echo "" + @echo "Quick Start:" + @echo " just install Install dependencies" + @echo " just deploy Deploy in DRY_RUN mode (safe)" + @echo " just logs Tail CloudWatch logs" + @echo " just invoke-aws Test Lambda execution" + @echo "" + @echo "Deployment:" + @echo " just deploy Deploy in DRY_RUN mode (default)" + @echo " just deploy-live Deploy in LIVE mode (destructive!)" + @echo " just diff Preview infrastructure changes" + @echo " just synth Generate CloudFormation template" + @echo " just destroy Destroy the entire stack" + @echo "" + @echo "Monitoring:" + @echo " just logs Tail CloudWatch logs (follow)" + @echo " just logs-recent Show logs from last hour" + @echo " just invoke-aws Manually invoke Lambda" + @echo " just info Show Lambda configuration" + @echo " just outputs Show stack outputs" + @echo " just params Show stack parameters" + @echo "" + @echo "Testing & Quality:" + @echo " just test Run unit tests" + @echo " just test-coverage Run tests with detailed coverage" + @echo " just lint Run linters" + @echo " just format Format code" + @echo " just ci Full CI pipeline (lint + test + synth)" + @echo "" + @echo "Maintenance:" + @echo " just update-code Fast Lambda code update (no CDK)" + @echo " just upgrade Upgrade all dependencies" + @echo " just versions Show installed versions" + @echo " just clean Clean build artifacts" + @echo " just validate Validate CloudFormation template" + @echo "" + @echo "Run 'just --list' for all commands" + +# Install dependencies +install: + @echo "Installing CDK and Lambda dependencies..." + uv pip install -r requirements.txt + @echo "Installing Lambda dependencies..." + cd lambda && uv pip install -r aws_resource_cleanup/requirements.txt + +# Bootstrap CDK (first time only) +bootstrap: + @echo "Bootstrapping CDK in {{region}} with profile {{profile}}..." + uv run cdk bootstrap aws://$(aws sts get-caller-identity --profile {{profile}} --query Account --output text)/{{region}} \ + --profile {{profile}} \ + --region {{region}} + +# Synthesize CloudFormation template +synth: + @echo "Synthesizing CloudFormation template..." + uv run cdk synth --profile {{profile}} --region {{region}} + +# Deploy in DRY_RUN mode (default, safe) +deploy: + @echo "Deploying in DRY_RUN mode (safe)..." + uv run cdk deploy \ + --profile {{profile}} \ + --region {{region}} \ + --require-approval never + +# Deploy in LIVE mode (destructive, requires confirmation) +deploy-live: + @echo "⚠️ WARNING: Deploying in LIVE mode (will delete resources!)" + @echo "Press Ctrl+C to cancel, or Enter to continue..." + @read _ + uv run cdk deploy \ + --profile {{profile}} \ + --region {{region}} \ + --parameters DryRunMode=false \ + --require-approval never + +# Deploy with custom parameters +deploy-custom DRY_RUN="true" THRESHOLD="30" EMAIL="": + @echo "Deploying with custom parameters..." + uv run cdk deploy \ + --profile {{profile}} \ + --region {{region}} \ + --parameters DryRunMode={{DRY_RUN}} \ + --parameters UntaggedThresholdMinutes={{THRESHOLD}} \ + --parameters NotificationEmail={{EMAIL}} + +# Destroy the stack (cleanup) +destroy: + @echo "⚠️ WARNING: This will destroy the entire stack!" + @echo "Press Ctrl+C to cancel, or Enter to continue..." + @read _ + uv run cdk destroy --profile {{profile}} --region {{region}} --force + +# Diff against deployed stack +diff: + @echo "Comparing local changes with deployed stack..." + uv run cdk diff --profile {{profile}} --region {{region}} + +# Tail CloudWatch logs +logs: + @echo "Tailing CloudWatch logs for Lambda..." + aws logs tail /aws/lambda/{{lambda_function}} \ + --follow \ + --format short \ + --profile {{profile}} \ + --region {{region}} + +# Tail recent logs (last hour) +logs-recent: + @echo "Showing logs from last hour..." + aws logs tail /aws/lambda/{{lambda_function}} \ + --since 1h \ + --format short \ + --profile {{profile}} \ + --region {{region}} + +# Invoke Lambda manually (AWS) +invoke-aws: + @echo "Invoking Lambda in AWS..." + aws lambda invoke \ + --function-name {{lambda_function}} \ + --profile {{profile}} \ + --region {{region}} \ + --log-type Tail \ + /tmp/lambda-response.json + @echo "\nResponse:" + @cat /tmp/lambda-response.json | jq '.' + @rm /tmp/lambda-response.json + +# Get Lambda function info +info: + @echo "Lambda function information:" + aws lambda get-function \ + --function-name {{lambda_function}} \ + --profile {{profile}} \ + --region {{region}} \ + --query 'Configuration.{Name:FunctionName,Runtime:Runtime,Memory:MemorySize,Timeout:Timeout,Modified:LastModified,Architecture:Architectures[0]}' \ + --output table + +# Run unit tests +test: + @echo "Running unit tests..." + PYTHONPATH=lambda:$$PYTHONPATH uv run --python 3.13 --with pytest --with pytest-cov --with 'aws-lambda-powertools[tracer]' --with boto3 --with botocore pytest tests/ -v --cov=aws_resource_cleanup + +# Run unit tests with detailed coverage report +test-coverage: + @echo "Running unit tests with coverage report..." + PYTHONPATH=lambda:$$PYTHONPATH uv run --python 3.13 --with pytest --with pytest-cov --with 'aws-lambda-powertools[tracer]' --with boto3 --with botocore pytest tests/ -v --cov=aws_resource_cleanup --cov-report=term-missing + +# Run linting +lint: + @echo "Running linters..." + uv run --with ruff ruff check lambda/aws_resource_cleanup/ + uv run --with black black --check lambda/aws_resource_cleanup/ + uv run --with mypy mypy lambda/aws_resource_cleanup/ + +# Format code +format: + @echo "Formatting code..." + uv run --with black black lambda/aws_resource_cleanup/ + uv run --with ruff ruff check --fix lambda/aws_resource_cleanup/ + +# Clean build artifacts +clean: + @echo "Cleaning build artifacts..." + rg -g '*.pyc' --files | xargs rm -f + rg -g '__pycache__' --files | xargs rm -rf + rm -rf cdk.out + rm -rf .pytest_cache + rm -rf tests/.pytest_cache + rm -rf lambda/aws_resource_cleanup.egg-info + rm -f /tmp/lambda-response.json + +# Full CI pipeline (lint, test, synth) +ci: lint test synth + @echo "✓ CI pipeline completed successfully" + +# Watch for changes and auto-deploy +watch: + @echo "Watching for changes (auto-deploy on save)..." + uv run cdk watch --profile {{profile}} --region {{region}} + +# Show stack outputs +outputs: + @echo "Stack outputs:" + aws cloudformation describe-stacks \ + --stack-name AWSResourcesCleanupStack \ + --profile {{profile}} \ + --region {{region}} \ + --query 'Stacks[0].Outputs' \ + --output table + +# Show stack parameters +params: + @echo "Stack parameters:" + aws cloudformation describe-stacks \ + --stack-name AWSResourcesCleanupStack \ + --profile {{profile}} \ + --region {{region}} \ + --query 'Stacks[0].Parameters' \ + --output table + +# List all Lambda functions +list-lambdas: + @echo "All Lambda functions in {{region}}:" + aws lambda list-functions \ + --profile {{profile}} \ + --region {{region}} \ + --query 'Functions[?starts_with(FunctionName, `Lambda`)].{Name:FunctionName,Runtime:Runtime,Size:CodeSize,Modified:LastModified}' \ + --output table + +# Update Lambda code only (faster than full deploy) +update-code: + @echo "Building Lambda package..." + cd lambda && zip -r /tmp/lambda-code.zip aws_resource_cleanup/ + @echo "Updating Lambda function code..." + aws lambda update-function-code \ + --function-name {{lambda_function}} \ + --zip-file fileb:///tmp/lambda-code.zip \ + --profile {{profile}} \ + --region {{region}} + @rm /tmp/lambda-code.zip + @echo "✓ Lambda code updated" + +# Update Lambda environment variables +update-env DRY_RUN="true": + @echo "Updating Lambda environment variables..." + aws lambda update-function-configuration \ + --function-name {{lambda_function}} \ + --environment "Variables={DRY_RUN={{DRY_RUN}}}" \ + --profile {{profile}} \ + --region {{region}} + @echo "✓ Environment updated to DRY_RUN={{DRY_RUN}}" + +# Validate CloudFormation template +validate: + @echo "Validating CloudFormation template..." + uv run cdk synth --profile {{profile}} --region {{region}} > /tmp/template.yaml + aws cloudformation validate-template \ + --template-body file:///tmp/template.yaml \ + --profile {{profile}} \ + --region {{region}} + @rm /tmp/template.yaml + @echo "✓ Template is valid" + +# Upgrade all dependencies +upgrade: + @echo "Upgrading CDK and Python dependencies..." + uv pip install --upgrade aws-cdk-lib constructs boto3 + @echo "Upgrading dev tools..." + uv pip install --upgrade pytest pytest-cov moto ruff black mypy + @echo "✓ All dependencies upgraded" + @echo "Run 'just synth' to verify CDK works" + +# Show versions +versions: + @echo "CDK version:" + @uv run cdk --version + @echo "\nPython packages:" + @uv pip list | grep -E "(aws-cdk-lib|constructs|boto3|pytest|ruff|black|mypy)" diff --git a/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/__init__.py b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/__init__.py new file mode 100644 index 0000000000..bfbe23f5ff --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/__init__.py @@ -0,0 +1,5 @@ +"""EC2 Cleanup Lambda - Modular implementation.""" + +from .handler import lambda_handler + +__all__ = ["lambda_handler"] diff --git a/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/ec2/__init__.py b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/ec2/__init__.py new file mode 100644 index 0000000000..0fbfcb1889 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/ec2/__init__.py @@ -0,0 +1,31 @@ +"""EC2 instance management, volume cleanup, and cleanup policies.""" + +from .instances import ( + cirrus_ci_add_iit_billing_tag, + is_protected, + execute_cleanup_action, +) +from .policies import ( + check_ttl_expiration, + check_stop_after_days, + check_long_stopped, + check_untagged, +) +from .volumes import ( + check_unattached_volume, + delete_volume, + is_volume_protected, +) + +__all__ = [ + "cirrus_ci_add_iit_billing_tag", + "is_protected", + "execute_cleanup_action", + "check_ttl_expiration", + "check_stop_after_days", + "check_long_stopped", + "check_untagged", + "check_unattached_volume", + "delete_volume", + "is_volume_protected", +] diff --git a/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/ec2/instances.py b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/ec2/instances.py new file mode 100644 index 0000000000..9abd5842b0 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/ec2/instances.py @@ -0,0 +1,285 @@ +"""EC2 instance operations.""" + +from __future__ import annotations +import boto3 +from botocore.exceptions import ClientError +from typing import Any +from ..models import CleanupAction +from ..models.config import DRY_RUN, PERSISTENT_TAGS +from ..utils import has_valid_billing_tag, get_logger +from ..eks.cloudformation import delete_eks_cluster_stack +from ..openshift.orchestrator import destroy_openshift_cluster +from ..openshift.detection import detect_openshift_infra_id + +logger = get_logger() + + +def cirrus_ci_add_iit_billing_tag( + instance: dict[str, Any], tags_dict: dict[str, str] +) -> None: + """Add iit-billing-tag to CirrusCI instances (existing functionality).""" + has_cirrus_ci_tag = tags_dict.get("CIRRUS_CI", "").lower() == "true" + has_iit_billing_tag = "iit-billing-tag" in tags_dict + + if has_cirrus_ci_tag and not has_iit_billing_tag: + try: + ec2_resource = boto3.resource( + "ec2", region_name=instance["Placement"]["AvailabilityZone"][:-1] + ) + ec2_instance = ec2_resource.Instance(instance["InstanceId"]) + ec2_instance.create_tags( + Tags=[{"Key": "iit-billing-tag", "Value": "CirrusCI"}] + ) + + instance_name = tags_dict.get("Name") + cirrus_repo = tags_dict.get("CIRRUS_REPO_FULL_NAME") + cirrus_task = tags_dict.get("CIRRUS_TASK_ID") + + logger.info( + "CirrusCI instance auto-tagged", + extra={ + "instance_id": instance["InstanceId"], + "instance_name": instance_name, + "billing_tag": "CirrusCI", + "cirrus_repo": cirrus_repo, + "cirrus_task": cirrus_task, + }, + ) + except ClientError as e: + logger.error( + f"Error tagging CirrusCI instance {instance['InstanceId']}: {e}" + ) + + +def is_protected(tags_dict: dict[str, str], instance_id: str = "") -> tuple[bool, str]: + """ + Check if instance is protected from auto-deletion. + + Returns: + Tuple of (is_protected, reason) where reason describes why it's protected + """ + billing_tag = tags_dict.get("iit-billing-tag", "") + name = tags_dict.get("Name", "") + + # Protected by persistent billing tag + if billing_tag in PERSISTENT_TAGS: + reason = f"Persistent billing tag '{billing_tag}'" + if instance_id: + logger.info( + "Instance protected", + extra={ + "instance_id": instance_id, + "instance_name": name, + "protection_reason": reason, + "billing_tag": billing_tag, + }, + ) + return True, reason + + # Protected if has valid billing tag (category or non-expired timestamp) + if has_valid_billing_tag(tags_dict): + # But only if it doesn't have TTL tags (TTL takes precedence) + has_ttl = ( + "delete-cluster-after-hours" in tags_dict or "stop-after-days" in tags_dict + ) + if not has_ttl: + reason = f"Valid billing tag '{billing_tag}'" + if instance_id: + logger.info( + "Instance protected", + extra={ + "instance_id": instance_id, + "instance_name": name, + "protection_reason": reason, + "billing_tag": billing_tag, + }, + ) + return True, reason + + return False, "" + + +def execute_cleanup_action(action: CleanupAction, region: str) -> bool: + """Execute a cleanup action (terminate, stop, etc.).""" + ec2 = boto3.client("ec2", region_name=region) + + try: + if action.action == "TERMINATE": + if DRY_RUN: + logger.info( + "Would TERMINATE instance", + extra={ + "dry_run": True, + "instance_id": action.instance_id, + "region": region, + "reason": action.reason, + }, + ) + else: + logger.info( + "TERMINATE instance", + extra={ + "instance_id": action.instance_id, + "region": region, + "reason": action.reason, + }, + ) + ec2.terminate_instances(InstanceIds=[action.instance_id]) + return True + + elif action.action == "TERMINATE_CLUSTER": + from ..models.config import EKS_CLEANUP_ENABLED + + if not action.cluster_name: + logger.error( + "Missing cluster_name for TERMINATE_CLUSTER action", + extra={"instance_id": action.instance_id, "action": action.action}, + ) + return False + + if EKS_CLEANUP_ENABLED: + if DRY_RUN: + logger.info( + "Would TERMINATE_CLUSTER eks", + extra={ + "dry_run": True, + "cluster_name": action.cluster_name, + "cluster_type": "eks", + "region": region, + }, + ) + logger.info( + "Would TERMINATE instance for cluster", + extra={ + "dry_run": True, + "instance_id": action.instance_id, + "cluster_name": action.cluster_name, + }, + ) + else: + logger.info( + "TERMINATE_CLUSTER eks", + extra={ + "cluster_name": action.cluster_name, + "cluster_type": "eks", + "region": region, + }, + ) + delete_eks_cluster_stack(action.cluster_name, region) + logger.info( + "TERMINATE instance for cluster", + extra={ + "instance_id": action.instance_id, + "cluster_name": action.cluster_name, + }, + ) + ec2.terminate_instances(InstanceIds=[action.instance_id]) + else: + logger.info( + "EKS cleanup disabled", + extra={ + "instance_id": action.instance_id, + "action": "TERMINATE_only", + }, + ) + return True + + elif action.action == "TERMINATE_OPENSHIFT_CLUSTER": + from ..models.config import OPENSHIFT_CLEANUP_ENABLED + + if not action.cluster_name: + logger.error( + "Missing cluster_name for TERMINATE_OPENSHIFT_CLUSTER action", + extra={"instance_id": action.instance_id, "action": action.action}, + ) + return False + + if OPENSHIFT_CLEANUP_ENABLED: + cluster_name = action.cluster_name + infra_id = detect_openshift_infra_id(cluster_name, region) + if infra_id: + if DRY_RUN: + logger.info( + "Would TERMINATE_OPENSHIFT_CLUSTER", + extra={ + "dry_run": True, + "cluster_name": cluster_name, + "infra_id": infra_id, + "cluster_type": "openshift", + "region": region, + }, + ) + else: + logger.info( + "TERMINATE_OPENSHIFT_CLUSTER", + extra={ + "cluster_name": cluster_name, + "infra_id": infra_id, + "cluster_type": "openshift", + "region": region, + }, + ) + destroy_openshift_cluster(cluster_name, infra_id, region) + if DRY_RUN: + logger.info( + "Would TERMINATE instance for cluster", + extra={ + "dry_run": True, + "instance_id": action.instance_id, + "cluster_name": cluster_name, + }, + ) + else: + logger.info( + "TERMINATE instance for cluster", + extra={ + "instance_id": action.instance_id, + "cluster_name": cluster_name, + }, + ) + ec2.terminate_instances(InstanceIds=[action.instance_id]) + else: + logger.info( + "OpenShift cleanup disabled", + extra={ + "instance_id": action.instance_id, + "action": "TERMINATE_only", + }, + ) + return True + + elif action.action == "STOP": + if DRY_RUN: + logger.info( + "Would STOP instance", + extra={ + "dry_run": True, + "instance_id": action.instance_id, + "region": region, + "reason": action.reason, + }, + ) + else: + logger.info( + "STOP instance", + extra={ + "instance_id": action.instance_id, + "region": region, + "reason": action.reason, + }, + ) + ec2.stop_instances(InstanceIds=[action.instance_id]) + return True + + except ClientError as e: + logger.error( + "Failed to execute cleanup action", + extra={ + "action": action.action, + "instance_id": action.instance_id, + "error": str(e), + }, + ) + return False + + return False diff --git a/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/ec2/policies.py b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/ec2/policies.py new file mode 100644 index 0000000000..0fc0380109 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/ec2/policies.py @@ -0,0 +1,206 @@ +"""EC2 cleanup policy checks (TTL, stop-after-days, long-stopped, untagged).""" + +from __future__ import annotations +import datetime +from typing import Any +from ..models import CleanupAction +from ..models.config import UNTAGGED_THRESHOLD_MINUTES, STOPPED_THRESHOLD_DAYS +from ..utils import extract_cluster_name, has_valid_billing_tag, get_logger + +logger = get_logger() + + +def check_ttl_expiration( + instance: dict[str, Any], tags_dict: dict[str, str], current_time: int +) -> CleanupAction | None: + """ + Check if instance has expired based on creation-time + delete-cluster-after-hours. + Returns CleanupAction if expired, None otherwise. + """ + creation_time_str = tags_dict.get("creation-time") + ttl_hours_str = tags_dict.get("delete-cluster-after-hours") + + if not creation_time_str or not ttl_hours_str: + return None + + try: + creation_time = int(creation_time_str) + ttl_hours = int(ttl_hours_str) + except ValueError: + logger.warning( + "Invalid TTL tags", + extra={ + "instance_id": instance["InstanceId"], + "creation_time": creation_time_str, + "ttl_hours": ttl_hours_str, + }, + ) + return None + + expiration_time = creation_time + (ttl_hours * 3600) + + if current_time >= expiration_time: + seconds_overdue = current_time - expiration_time + days_overdue = seconds_overdue / 86400 + + name = tags_dict.get("Name", "N/A") + billing_tag = tags_dict.get("iit-billing-tag", "unknown") + cluster_name = extract_cluster_name(tags_dict) + owner = tags_dict.get("owner", "unknown") + + created_at = datetime.datetime.fromtimestamp( + creation_time, tz=datetime.timezone.utc + ).strftime("%Y-%m-%d %H:%M:%S UTC") + expired_at = datetime.datetime.fromtimestamp( + expiration_time, tz=datetime.timezone.utc + ).strftime("%Y-%m-%d %H:%M:%S UTC") + + reason = f"TTL expired: {ttl_hours}h policy. Created {created_at}, expired {expired_at}" + + action = "TERMINATE" + if cluster_name: + is_openshift = billing_tag == "openshift" or any( + tag.startswith("openshift-") for tag in tags_dict.keys() + ) + action = ( + "TERMINATE_OPENSHIFT_CLUSTER" if is_openshift else "TERMINATE_CLUSTER" + ) + + return CleanupAction( + instance_id=instance["InstanceId"], + region="", # Set by caller + name=name, + action=action, + reason=reason, + days_overdue=days_overdue, + billing_tag=billing_tag, + cluster_name=cluster_name, + owner=owner, + ) + + return None + + +def check_stop_after_days( + instance: dict[str, Any], tags_dict: dict[str, str], current_time: int +) -> CleanupAction | None: + """ + Check if instance should be stopped based on stop-after-days policy. + Used for PMM staging instances. + """ + stop_after_days_str = tags_dict.get("stop-after-days") + + if not stop_after_days_str or instance["State"]["Name"] != "running": + return None + + try: + stop_after_days = int(stop_after_days_str) + except ValueError: + return None + + launch_time = instance.get("LaunchTime") + if not launch_time: + return None + + launch_timestamp = int(launch_time.timestamp()) + stop_at_time = launch_timestamp + (stop_after_days * 86400) + + if current_time >= stop_at_time: + seconds_overdue = current_time - stop_at_time + days_overdue = seconds_overdue / 86400 + + name = tags_dict.get("Name", "N/A") + billing_tag = tags_dict.get("iit-billing-tag", "unknown") + owner = tags_dict.get("owner", "unknown") + + launched_at = launch_time.strftime("%Y-%m-%d %H:%M:%S UTC") + + reason = f"Stop policy: {stop_after_days}d. Launched {launched_at}" + + return CleanupAction( + instance_id=instance["InstanceId"], + region="", + name=name, + action="STOP", + reason=reason, + days_overdue=days_overdue, + billing_tag=billing_tag, + owner=owner, + ) + + return None + + +def check_long_stopped( + instance: dict[str, Any], tags_dict: dict[str, str], current_time: int +) -> CleanupAction | None: + """ + Check if instance has been stopped for more than the configured threshold. + Stopped instances still incur EBS storage costs. + Uses configurable threshold from environment variable. + """ + if instance["State"]["Name"] != "stopped": + return None + + launch_time = instance.get("LaunchTime") + if not launch_time: + return None + + launch_timestamp = int(launch_time.timestamp()) + days_since_launch = (current_time - launch_timestamp) / 86400 + + if days_since_launch > STOPPED_THRESHOLD_DAYS: + name = tags_dict.get("Name", "N/A") + billing_tag = tags_dict.get("iit-billing-tag", "unknown") + + days_overdue = days_since_launch - STOPPED_THRESHOLD_DAYS + reason = f"Stopped instance older than {STOPPED_THRESHOLD_DAYS} days" + + return CleanupAction( + instance_id=instance["InstanceId"], + region="", + name=name, + action="TERMINATE", + reason=reason, + days_overdue=days_overdue, + billing_tag=billing_tag, + ) + + return None + + +def check_untagged( + instance: dict[str, Any], tags_dict: dict[str, str], current_time: int +) -> CleanupAction | None: + """ + Check if instance is untagged or has invalid billing tag. + Uses configurable threshold from environment variable. + """ + # Skip if has valid billing tag (including non-expired timestamps) + if has_valid_billing_tag(tags_dict, instance.get("LaunchTime")): + return None + + launch_time = instance.get("LaunchTime") + if not launch_time: + return None + + launch_timestamp = int(launch_time.timestamp()) + minutes_running = (current_time - launch_timestamp) / 60 + + if minutes_running < UNTAGGED_THRESHOLD_MINUTES: + return None + + days_running = minutes_running / 1440 + name = tags_dict.get("Name", "N/A") + + reason = f"Missing billing tag. Running {minutes_running:.0f} minutes (threshold: {UNTAGGED_THRESHOLD_MINUTES})" + + return CleanupAction( + instance_id=instance["InstanceId"], + region="", + name=name, + action="TERMINATE", + reason=reason, + days_overdue=days_running, + billing_tag="", + ) diff --git a/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/ec2/volumes.py b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/ec2/volumes.py new file mode 100644 index 0000000000..dbe5caacd8 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/ec2/volumes.py @@ -0,0 +1,263 @@ +"""EBS volume cleanup detection and execution.""" + +from __future__ import annotations +import boto3 +from botocore.exceptions import ClientError +from typing import Any +from ..models import CleanupAction +from ..models.config import DRY_RUN, PERSISTENT_TAGS +from ..utils import convert_tags_to_dict, has_valid_billing_tag, get_logger + +logger = get_logger() + + +def is_volume_protected( + tags_dict: dict[str, str], volume_id: str = "" +) -> tuple[bool, str]: + """ + Check if volume is protected from auto-deletion. + + Protection mechanisms (from legacy LambdaVolumeCleanup.yml): + 1. Name tag contains "do not remove" + 2. Has "PerconaKeep" tag + 3. Has persistent billing tag (jenkins-*, pmm-dev) + 4. Has valid billing tag (category or non-expired timestamp) + + Returns: + Tuple of (is_protected, reason) where reason describes why it's protected + """ + name = tags_dict.get("Name", "") + + # Legacy protection: "do not remove" in Name tag + if "do not remove" in name.lower(): + reason = "Name contains 'do not remove'" + if volume_id: + logger.info( + "Volume protected", + extra={ + "volume_id": volume_id, + "volume_name": name, + "protection_reason": reason, + }, + ) + return True, reason + + # Legacy protection: PerconaKeep tag + if "PerconaKeep" in tags_dict: + reason = "Has PerconaKeep tag" + if volume_id: + logger.info( + "Volume protected", + extra={ + "volume_id": volume_id, + "volume_name": name, + "protection_reason": reason, + }, + ) + return True, reason + + # Protection by persistent billing tag + billing_tag = tags_dict.get("iit-billing-tag", "") + if billing_tag in PERSISTENT_TAGS: + reason = f"Persistent billing tag '{billing_tag}'" + if volume_id: + logger.info( + "Volume protected", + extra={ + "volume_id": volume_id, + "volume_name": name, + "protection_reason": reason, + "billing_tag": billing_tag, + }, + ) + return True, reason + + # Protected if has valid billing tag (category or non-expired timestamp) + if has_valid_billing_tag(tags_dict): + reason = f"Valid billing tag '{billing_tag}'" + if volume_id: + logger.info( + "Volume protected", + extra={ + "volume_id": volume_id, + "volume_name": name, + "protection_reason": reason, + "billing_tag": billing_tag, + }, + ) + return True, reason + + return False, "" + + +def check_unattached_volume( + volume: dict[str, Any], tags_dict: dict[str, str], current_time: int +) -> CleanupAction | None: + """ + Check if volume is unattached and eligible for deletion. + + Updated logic to include untagged volumes: + - Must be in "available" state (unattached) + - Must not be protected (by tags, Name, billing tags, etc.) + - Untagged volumes are candidates for deletion + """ + # Must be available (unattached) + if volume["State"] != "available": + return None + + volume_id = volume["VolumeId"] + + # Check protection (includes Name tag, PerconaKeep, billing tags) + is_protected_flag, _ = is_volume_protected(tags_dict, volume_id) + if is_protected_flag: + return None + + # Calculate age + create_time = volume.get("CreateTime") + if not create_time: + logger.warning( + "Volume missing CreateTime, skipping", + extra={"volume_id": volume["VolumeId"]}, + ) + return None + + create_timestamp = int(create_time.timestamp()) + age_seconds = current_time - create_timestamp + age_days = age_seconds / 86400 + + name = tags_dict.get("Name", "") + billing_tag = tags_dict.get("iit-billing-tag", "") + size_gb = volume.get("Size", 0) + volume_type = volume.get("VolumeType", "unknown") + + reason = ( + f"Unattached volume ({size_gb}GB {volume_type}, " + f"created {create_time.strftime('%Y-%m-%d %H:%M:%S UTC')}, " + f"{age_days:.1f} days old)" + ) + + return CleanupAction( + instance_id="", # Empty for volumes + region="", # Set by caller + name=name, + action="DELETE_VOLUME", + reason=reason, + days_overdue=age_days, + billing_tag=billing_tag, + resource_type="volume", + volume_id=volume_id, + ) + + +def delete_volume(action: CleanupAction, region: str) -> bool: + """ + Delete an EBS volume. + + Args: + action: CleanupAction with volume_id + region: AWS region + + Returns: + True if successful (or DRY_RUN), False otherwise + """ + if not action.volume_id: + logger.error( + "Cannot delete volume: missing volume_id", extra={"action": str(action)} + ) + return False + + try: + ec2 = boto3.client("ec2", region_name=region) + + if DRY_RUN: + logger.info( + "Would DELETE volume", + extra={ + "dry_run": True, + "volume_id": action.volume_id, + "region": region, + "reason": action.reason, + }, + ) + return True + + # Final safety check: verify volume is still available and not protected + volumes = ec2.describe_volumes(VolumeIds=[action.volume_id])["Volumes"] + if not volumes: + logger.warning( + "Volume not found, skipping deletion", + extra={"volume_id": action.volume_id}, + ) + return False + + volume = volumes[0] + if volume["State"] != "available": + logger.warning( + "Volume no longer available, skipping deletion", + extra={"volume_id": action.volume_id, "volume_state": volume["State"]}, + ) + return False + + # Re-check protection (safety) + tags_dict = convert_tags_to_dict(volume.get("Tags", [])) + is_protected_flag, _ = is_volume_protected(tags_dict) + if is_protected_flag: + logger.warning( + "Volume now protected, skipping deletion", + extra={"volume_id": action.volume_id}, + ) + return False + + # Delete volume + ec2.delete_volume(VolumeId=action.volume_id) + logger.info( + "DELETE volume", + extra={ + "volume_id": action.volume_id, + "region": region, + "reason": action.reason, + }, + ) + return True + + except ClientError as e: + error_code = e.response.get("Error", {}).get("Code", "Unknown") + error_msg = e.response.get("Error", {}).get("Message", str(e)) + + if error_code == "InvalidVolume.NotFound": + logger.warning( + "Volume not found (already deleted?)", + extra={ + "volume_id": action.volume_id, + "error_code": error_code, + "error_message": error_msg, + }, + ) + return False + elif error_code == "VolumeInUse": + logger.warning( + "Volume in use, cannot delete", + extra={ + "volume_id": action.volume_id, + "error_code": error_code, + "error_message": error_msg, + }, + ) + return False + else: + logger.error( + "Failed to delete volume", + extra={ + "volume_id": action.volume_id, + "error_code": error_code, + "error_message": error_msg, + }, + ) + return False + + except Exception as e: + logger.error( + "Unexpected error deleting volume", + extra={"volume_id": action.volume_id, "error": str(e)}, + ) + return False diff --git a/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/eks/__init__.py b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/eks/__init__.py new file mode 100644 index 0000000000..40efa2eb42 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/eks/__init__.py @@ -0,0 +1,13 @@ +"""EKS cluster cleanup via CloudFormation.""" + +from .cloudformation import ( + get_eks_cloudformation_billing_tag, + cleanup_failed_stack_resources, + delete_eks_cluster_stack, +) + +__all__ = [ + "get_eks_cloudformation_billing_tag", + "cleanup_failed_stack_resources", + "delete_eks_cluster_stack", +] diff --git a/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/eks/cloudformation.py b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/eks/cloudformation.py new file mode 100644 index 0000000000..ab4d29191c --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/eks/cloudformation.py @@ -0,0 +1,184 @@ +"""EKS CloudFormation stack operations.""" + +from __future__ import annotations +import boto3 +from botocore.exceptions import ClientError +from ..models.config import DRY_RUN +from ..utils import get_logger + +logger = get_logger() + + +def get_eks_cloudformation_billing_tag(cluster_name: str, region: str) -> str | None: + """Check CloudFormation stack for iit-billing-tag.""" + try: + cfn = boto3.client("cloudformation", region_name=region) + stack_name = f"eksctl-{cluster_name}-cluster" + + response = cfn.describe_stacks(StackName=stack_name) + stack_tags = { + tag["Key"]: tag["Value"] for tag in response["Stacks"][0].get("Tags", []) + } + + return stack_tags.get("iit-billing-tag") + except ClientError as e: + if "does not exist" in str(e): + logger.warning(f"CloudFormation stack {stack_name} not found in {region}") + return None + logger.error(f"Error checking CloudFormation stack tags: {e}") + return None + except Exception as e: + logger.error(f"Unexpected error checking CloudFormation stack: {e}") + return None + + +def cleanup_failed_stack_resources(stack_name: str, region: str) -> bool: + """Manually clean up resources that prevent stack deletion.""" + try: + cfn = boto3.client("cloudformation", region_name=region) + ec2 = boto3.client("ec2", region_name=region) + + # Get failed resources from stack events + events = cfn.describe_stack_events(StackName=stack_name) + failed_resources = {} + + for event in events["StackEvents"]: + if event.get("ResourceStatus") == "DELETE_FAILED": + logical_id = event["LogicalResourceId"] + if logical_id not in failed_resources: + failed_resources[logical_id] = { + "Type": event["ResourceType"], + "PhysicalId": event.get("PhysicalResourceId"), + } + + if not failed_resources: + return True + + logger.info( + f"Attempting to clean up {len(failed_resources)} failed resources " + f"for stack {stack_name}" + ) + + # Process each failed resource type + for logical_id, resource in failed_resources.items(): + resource_type = resource["Type"] + physical_id = resource["PhysicalId"] + + try: + # Clean up security group ingress rules + if resource_type == "AWS::EC2::SecurityGroupIngress" and physical_id: + sg_id = physical_id.split("|")[0] if "|" in physical_id else None + if sg_id and sg_id.startswith("sg-"): + response = ec2.describe_security_groups(GroupIds=[sg_id]) + if response["SecurityGroups"]: + sg = response["SecurityGroups"][0] + if sg["IpPermissions"]: + ec2.revoke_security_group_ingress( + GroupId=sg_id, IpPermissions=sg["IpPermissions"] + ) + logger.info(f"Cleaned up ingress rules for {sg_id}") + + # Clean up route table associations + elif ( + resource_type == "AWS::EC2::SubnetRouteTableAssociation" + and physical_id + ): + if physical_id.startswith("rtbassoc-"): + ec2.disassociate_route_table(AssociationId=physical_id) + logger.info(f"Disassociated route table {physical_id}") + + # Clean up routes + elif resource_type == "AWS::EC2::Route" and physical_id: + parts = physical_id.split("_") + if len(parts) == 2 and parts[0].startswith("rtb-"): + rtb_id = parts[0] + dest_cidr = parts[1] + ec2.delete_route( + RouteTableId=rtb_id, DestinationCidrBlock=dest_cidr + ) + logger.info(f"Deleted route {dest_cidr} from {rtb_id}") + + except ClientError as e: + error_code = e.response.get("Error", {}).get("Code", "") + if error_code not in [ + "InvalidGroup.NotFound", + "InvalidAssociationID.NotFound", + "InvalidRoute.NotFound", + ]: + logger.warning( + f"Could not clean up {resource_type} {physical_id}: {e}" + ) + except Exception as e: + logger.warning( + f"Unexpected error cleaning up {resource_type} {physical_id}: {e}" + ) + + return True + + except Exception as e: + logger.error(f"Error cleaning up failed resources for stack {stack_name}: {e}") + return False + + +def delete_eks_cluster_stack(cluster_name: str, region: str) -> bool: + """Delete EKS cluster by removing its CloudFormation stack.""" + try: + cfn = boto3.client("cloudformation", region_name=region) + stack_name = f"eksctl-{cluster_name}-cluster" + + # Check if stack exists and its current status + try: + response = cfn.describe_stacks(StackName=stack_name) + stack_status = response["Stacks"][0]["StackStatus"] + except ClientError as e: + if "does not exist" in str(e): + logger.warning( + f"CloudFormation stack {stack_name} not found in {region}" + ) + return False + raise + + # Handle DELETE_FAILED status - retry after cleanup + if stack_status == "DELETE_FAILED": + if DRY_RUN: + logger.info( + f"[DRY-RUN] Would DELETE cloudformation_stack {stack_name} (retry after cleanup) in {region}" + ) + else: + logger.info( + f"Stack {stack_name} previously failed deletion, attempting cleanup and retry" + ) + cleanup_failed_stack_resources(stack_name, region) + cfn.delete_stack(StackName=stack_name) + logger.info( + f"DELETE cloudformation_stack {stack_name} (retrying after cleanup) in {region}" + ) + return True + + # Handle already deleting + if "DELETE" in stack_status and stack_status != "DELETE_COMPLETE": + logger.info(f"Stack {stack_name} already deleting (status: {stack_status})") + return True + + # Initiate deletion for new stacks + if DRY_RUN: + logger.info( + f"[DRY-RUN] Would DELETE cloudformation_stack {stack_name} for cluster {cluster_name} in {region}" + ) + else: + cfn.delete_stack(StackName=stack_name) + logger.info( + f"DELETE cloudformation_stack {stack_name} for cluster {cluster_name} in {region}" + ) + return True + + except ClientError as e: + logger.error( + f"Failed to delete CloudFormation stack for cluster {cluster_name} in {region}: {e}" + ) + return False + except Exception as e: + logger.error( + f"Unexpected error deleting cluster {cluster_name} in {region}: {e}" + ) + return False diff --git a/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/handler.py b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/handler.py new file mode 100644 index 0000000000..838eeebefa --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/handler.py @@ -0,0 +1,383 @@ +"""Main Lambda handler for AWS resources cleanup.""" + +from __future__ import annotations +import json +import time +import datetime +import boto3 +from typing import Any + +from aws_lambda_powertools import Tracer, Metrics +from aws_lambda_powertools.metrics import MetricUnit +from aws_lambda_powertools.utilities.typing import LambdaContext + +from .models import CleanupAction +from .models.config import ( + DRY_RUN, + SNS_TOPIC_ARN, + VOLUME_CLEANUP_ENABLED, + TARGET_REGIONS, +) +from .utils import convert_tags_to_dict, get_logger +from .ec2 import ( + cirrus_ci_add_iit_billing_tag, + is_protected, + execute_cleanup_action, + check_ttl_expiration, + check_stop_after_days, + check_long_stopped, + check_untagged, + check_unattached_volume, + delete_volume, +) + +logger = get_logger() +tracer = Tracer(service="aws-resource-cleanup") +metrics = Metrics(namespace="Percona/ResourceCleanup", service="aws-resource-cleanup") + + +def send_notification(actions: list[CleanupAction], region: str) -> None: + """Send SNS notification about cleanup actions.""" + if not SNS_TOPIC_ARN or not actions: + return + + try: + sns = boto3.client("sns") + + message_lines = [ + f"AWS Resources Cleanup Report - {region}", + f"Mode: {'DRY-RUN' if DRY_RUN else 'LIVE'}", + f"Timestamp: {datetime.datetime.now(datetime.timezone.utc).strftime('%Y-%m-%d %H:%M:%S UTC')}", + "", + f"Total Actions: {len(actions)}", + "", + ] + + for action in actions: + # Handle volumes differently from instances + if action.resource_type == "volume": + message_lines.append(f"Volume: {action.volume_id}") + else: + message_lines.append(f"Instance: {action.instance_id}") + + message_lines.append(f" Name: {action.name}") + message_lines.append(f" Action: {action.action}") + message_lines.append(f" Days Overdue: {action.days_overdue:.2f}") + message_lines.append(f" Reason: {action.reason}") + message_lines.append(f" Billing Tag: {action.billing_tag}") + if action.owner: + message_lines.append(f" Owner: {action.owner}") + if action.cluster_name: + message_lines.append(f" Cluster: {action.cluster_name}") + message_lines.append("") + + message = "\n".join(message_lines) + subject = f"[{'DRY-RUN' if DRY_RUN else 'LIVE'}] AWS Resources Cleanup: {len(actions)} actions in {region}" + + sns.publish( + TopicArn=SNS_TOPIC_ARN, + Subject=subject[:100], # SNS subject limit + Message=message, + ) + + logger.info( + "Sent SNS notification", + extra={"actions_count": len(actions), "region": region}, + ) + + except Exception as e: + logger.error(f"Failed to send SNS notification: {e}") + + +@tracer.capture_method +def cleanup_region(region: str) -> list[CleanupAction]: + """Process cleanup for a single region.""" + start_time = time.time() + logger.info("Processing region", extra={"region": region}) + + ec2 = boto3.client("ec2", region_name=region) + current_time = int(time.time()) + actions = [] + + # Track instance scan statistics + instance_scan_count = 0 + instance_protected_count = 0 + instance_protection_reasons: dict[str, int] = {} + + try: + response = ec2.describe_instances( + Filters=[{"Name": "instance-state-name", "Values": ["running", "stopped"]}] + ) + + for reservation in response["Reservations"]: + for instance in reservation["Instances"]: + instance_scan_count += 1 + tags_dict = convert_tags_to_dict(instance.get("Tags", [])) + + # Auto-tag CirrusCI instances (existing functionality) + cirrus_ci_add_iit_billing_tag(instance, tags_dict) + + # Skip protected resources + is_protected_flag, protection_reason = is_protected( + tags_dict, instance["InstanceId"] + ) + if is_protected_flag: + instance_protected_count += 1 + instance_protection_reasons[protection_reason] = ( + instance_protection_reasons.get(protection_reason, 0) + 1 + ) + continue + + # Check all cleanup policies (priority order) + action = ( + check_ttl_expiration(instance, tags_dict, current_time) + or check_stop_after_days(instance, tags_dict, current_time) + or check_long_stopped(instance, tags_dict, current_time) + or check_untagged(instance, tags_dict, current_time) + ) + + if action: + action.region = region + actions.append(action) + + # Log instance scan summary and emit metrics + logger.info( + "Instance scan complete", + extra={ + "region": region, + "instances_scanned": instance_scan_count, + "actions_count": len(actions), + "instances_protected": instance_protected_count, + "protection_reasons": instance_protection_reasons, + }, + ) + + # Emit instance metrics with region dimension + metrics.add_dimension(name="Region", value=region) + metrics.add_metric( + name="InstancesScanned", unit=MetricUnit.Count, value=instance_scan_count + ) + metrics.add_metric( + name="InstancesProtected", + unit=MetricUnit.Count, + value=instance_protected_count, + ) + metrics.add_metric( + name="InstanceActions", unit=MetricUnit.Count, value=len(actions) + ) + + # Execute instance actions + for action in actions: + execute_cleanup_action(action, region) + + # Volume cleanup phase (after instance cleanup) + volume_actions = [] + volume_scan_count = 0 + volume_protected_count = 0 + volume_protection_reasons: dict[str, int] = {} + + if not VOLUME_CLEANUP_ENABLED: + logger.info("Volume cleanup disabled", extra={"region": region}) + else: + try: + # Query all available (unattached) volumes + # Note: Removed legacy Name tag filter to catch untagged volumes + volumes_response = ec2.describe_volumes( + Filters=[{"Name": "status", "Values": ["available"]}] + ) + + for volume in volumes_response["Volumes"]: + volume_scan_count += 1 + tags_dict = convert_tags_to_dict(volume.get("Tags", [])) + + # Check protection first to track statistics + from .ec2.volumes import is_volume_protected + + is_protected_flag, protection_reason = is_volume_protected( + tags_dict, volume["VolumeId"] + ) + if is_protected_flag: + volume_protected_count += 1 + volume_protection_reasons[protection_reason] = ( + volume_protection_reasons.get(protection_reason, 0) + 1 + ) + continue + + # Check if volume should be deleted + volume_action = check_unattached_volume( + volume, tags_dict, current_time + ) + + if volume_action: + volume_action.region = region + volume_actions.append(volume_action) + + # Log volume scan summary and emit metrics + logger.info( + "Volume scan complete", + extra={ + "region": region, + "volumes_scanned": volume_scan_count, + "actions_count": len(volume_actions), + "volumes_protected": volume_protected_count, + "protection_reasons": volume_protection_reasons, + }, + ) + + # Emit volume metrics (region dimension already set) + metrics.add_metric( + name="VolumesScanned", + unit=MetricUnit.Count, + value=volume_scan_count, + ) + metrics.add_metric( + name="VolumesProtected", + unit=MetricUnit.Count, + value=volume_protected_count, + ) + metrics.add_metric( + name="VolumeActions", + unit=MetricUnit.Count, + value=len(volume_actions), + ) + + # Execute volume deletions + for volume_action in volume_actions: + delete_volume(volume_action, region) + + except Exception as vol_error: + logger.error(f"Error during volume cleanup in {region}: {vol_error}") + + # Combine all actions for notification + all_actions = actions + volume_actions + + # Send notification + if all_actions: + send_notification(all_actions, region) + + # Region completion with timing + duration = time.time() - start_time + logger.info( + "Region cleanup complete", + extra={ + "region": region, + "duration_seconds": round(duration, 1), + "instances_scanned": instance_scan_count, + "instance_actions": len(actions), + "volumes_scanned": volume_scan_count, + "volume_actions": len(volume_actions), + }, + ) + + except Exception as e: + logger.error(f"Error processing region {region}: {e}") + + return actions + volume_actions if "volume_actions" in locals() else actions + + +@logger.inject_lambda_context +@tracer.capture_lambda_handler +@metrics.log_metrics(capture_cold_start_metric=True) +def lambda_handler(event: dict[str, Any], context: LambdaContext) -> dict[str, Any]: + """Main Lambda handler.""" + start_time = time.time() + logger.info("Starting AWS resources cleanup", extra={"dry_run": DRY_RUN}) + + try: + ec2 = boto3.client("ec2") + all_regions = [ + region["RegionName"] for region in ec2.describe_regions()["Regions"] + ] + + # Filter regions based on TARGET_REGIONS parameter + if TARGET_REGIONS and TARGET_REGIONS.lower() != "all": + target_list = [r.strip() for r in TARGET_REGIONS.split(",") if r.strip()] + regions = [r for r in all_regions if r in target_list] + logger.info( + "Filtering to specific regions", + extra={"regions": regions, "regions_count": len(regions)}, + ) + else: + regions = all_regions + logger.info("Processing all regions", extra={"regions_count": len(regions)}) + + all_actions = [] + + for region in regions: + region_actions = cleanup_region(region) + all_actions.extend(region_actions) + + # Calculate summary statistics + total_duration = time.time() - start_time + action_counts: dict[str, int] = {} + volume_ages = [] + + for action in all_actions: + action_counts[action.action] = action_counts.get(action.action, 0) + 1 + # Collect volume ages for statistics + if action.action == "DELETE_VOLUME": + volume_ages.append(action.days_overdue) + + # Enhanced summary with volume statistics + summary_extra = { + "total_actions": len(all_actions), + "regions_count": len(regions), + "duration_seconds": round(total_duration, 1), + "actions_by_type": action_counts, + } + + if volume_ages: + summary_extra["volume_age_stats"] = { + "min_days": round(min(volume_ages), 1), + "max_days": round(max(volume_ages), 1), + "avg_days": round(sum(volume_ages) / len(volume_ages), 1), + } + + logger.info("Cleanup complete", extra=summary_extra) + + # Emit summary metrics (no region dimension for totals) + metrics.add_metric( + name="TotalActions", unit=MetricUnit.Count, value=len(all_actions) + ) + metrics.add_metric( + name="RegionsProcessed", unit=MetricUnit.Count, value=len(regions) + ) + metrics.add_metric( + name="ExecutionDuration", unit=MetricUnit.Seconds, value=total_duration + ) + + # Emit metrics per action type + for action_type, count in action_counts.items(): + metrics.add_metric( + name=f"Actions_{action_type}", unit=MetricUnit.Count, value=count + ) + + # Emit volume age statistics if available + if volume_ages: + metrics.add_metric( + name="VolumeAge_Min", unit=MetricUnit.Count, value=min(volume_ages) + ) + metrics.add_metric( + name="VolumeAge_Max", unit=MetricUnit.Count, value=max(volume_ages) + ) + metrics.add_metric( + name="VolumeAge_Avg", + unit=MetricUnit.Count, + value=sum(volume_ages) / len(volume_ages), + ) + + return { + "statusCode": 200, + "body": json.dumps( + { + "dry_run": DRY_RUN, + "total_actions": len(all_actions), + "by_action": action_counts, + "actions": [action.to_dict() for action in all_actions], + } + ), + } + + except Exception as e: + logger.error(f"Lambda execution failed: {e}") + raise diff --git a/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/models/__init__.py b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/models/__init__.py new file mode 100644 index 0000000000..d2fd12e016 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/models/__init__.py @@ -0,0 +1,6 @@ +"""Data models for EC2 cleanup Lambda.""" + +from .cleanup_action import CleanupAction +from .config import Config + +__all__ = ["CleanupAction", "Config"] diff --git a/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/models/cleanup_action.py b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/models/cleanup_action.py new file mode 100644 index 0000000000..615f8287f5 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/models/cleanup_action.py @@ -0,0 +1,28 @@ +"""CleanupAction data class.""" + +from __future__ import annotations +from dataclasses import dataclass, asdict +from typing import Any + + +@dataclass +class CleanupAction: + """Represents a cleanup action to be taken on AWS resources (instances, volumes, etc).""" + + instance_id: str # For instances; empty string for volumes + region: str + name: str + action: str # TERMINATE, STOP, DELETE_VOLUME, TERMINATE_CLUSTER, TERMINATE_OPENSHIFT_CLUSTER + reason: str + days_overdue: float + billing_tag: str = "" + cluster_name: str | None = None + owner: str | None = None + resource_type: str = "instance" # "instance" or "volume" + volume_id: str | None = None # For volumes; None for instances + + def to_dict(self) -> dict[str, Any]: + """Convert to dictionary for JSON serialization.""" + data = asdict(self) + data["days_overdue"] = round(self.days_overdue, 2) + return data diff --git a/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/models/config.py b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/models/config.py new file mode 100644 index 0000000000..d22831a9f3 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/models/config.py @@ -0,0 +1,63 @@ +"""Configuration from environment variables.""" + +import os + +# Configuration from environment variables +DRY_RUN = os.environ.get("DRY_RUN", "true").lower() == "true" +SNS_TOPIC_ARN = os.environ.get("SNS_TOPIC_ARN", "") + +# EC2 instance cleanup thresholds +UNTAGGED_THRESHOLD_MINUTES = int(os.environ.get("UNTAGGED_THRESHOLD_MINUTES", "30")) +STOPPED_THRESHOLD_DAYS = int(os.environ.get("STOPPED_THRESHOLD_DAYS", "30")) + +# EKS cleanup configuration +EKS_CLEANUP_ENABLED = os.environ.get("EKS_CLEANUP_ENABLED", "true").lower() == "true" +EKS_SKIP_PATTERN = os.environ.get("EKS_SKIP_PATTERN", "pe-.*") + +# OpenShift cleanup configuration +OPENSHIFT_CLEANUP_ENABLED = ( + os.environ.get("OPENSHIFT_CLEANUP_ENABLED", "true").lower() == "true" +) +OPENSHIFT_BASE_DOMAIN = os.environ.get("OPENSHIFT_BASE_DOMAIN", "cd.percona.com") + +# Volume cleanup configuration +VOLUME_CLEANUP_ENABLED = ( + os.environ.get("VOLUME_CLEANUP_ENABLED", "true").lower() == "true" +) + +# Region filtering +TARGET_REGIONS = os.environ.get("TARGET_REGIONS", "all") + +# Logging configuration +LOG_LEVEL = os.environ.get("LOG_LEVEL", "INFO").upper() + +# Persistent billing tags (never auto-delete) +PERSISTENT_TAGS = { + "jenkins-cloud", + "jenkins-fb", + "jenkins-pg", + "jenkins-ps3", + "jenkins-ps57", + "jenkins-ps80", + "jenkins-psmdb", + "jenkins-pxb", + "jenkins-pxc", + "jenkins-rel", + "pmm-dev", +} + + +class Config: + """Configuration singleton.""" + + def __init__(self): + self.dry_run = DRY_RUN + self.sns_topic_arn = SNS_TOPIC_ARN + self.untagged_threshold_minutes = UNTAGGED_THRESHOLD_MINUTES + self.stopped_threshold_days = STOPPED_THRESHOLD_DAYS + self.eks_cleanup_enabled = EKS_CLEANUP_ENABLED + self.eks_skip_pattern = EKS_SKIP_PATTERN + self.openshift_cleanup_enabled = OPENSHIFT_CLEANUP_ENABLED + self.openshift_base_domain = OPENSHIFT_BASE_DOMAIN + self.persistent_tags = PERSISTENT_TAGS + self.volume_cleanup_enabled = VOLUME_CLEANUP_ENABLED diff --git a/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/openshift/__init__.py b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/openshift/__init__.py new file mode 100644 index 0000000000..ed6a832b25 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/openshift/__init__.py @@ -0,0 +1,35 @@ +"""OpenShift cluster comprehensive cleanup.""" + +from .detection import detect_openshift_infra_id +from .compute import delete_load_balancers +from .network import ( + delete_nat_gateways, + release_elastic_ips, + cleanup_network_interfaces, + delete_vpc_endpoints, + delete_security_groups, + delete_subnets, + delete_route_tables, + delete_internet_gateway, + delete_vpc, +) +from .dns import cleanup_route53_records +from .storage import cleanup_s3_state +from .orchestrator import destroy_openshift_cluster + +__all__ = [ + "detect_openshift_infra_id", + "delete_load_balancers", + "delete_nat_gateways", + "release_elastic_ips", + "cleanup_network_interfaces", + "delete_vpc_endpoints", + "delete_security_groups", + "delete_subnets", + "delete_route_tables", + "delete_internet_gateway", + "delete_vpc", + "cleanup_route53_records", + "cleanup_s3_state", + "destroy_openshift_cluster", +] diff --git a/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/openshift/compute.py b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/openshift/compute.py new file mode 100644 index 0000000000..e6d29beafd --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/openshift/compute.py @@ -0,0 +1,58 @@ +"""OpenShift compute resources (EC2, Load Balancers).""" + +import boto3 +from ..models.config import DRY_RUN +from ..utils import get_logger + +logger = get_logger() + + +def delete_load_balancers(infra_id: str, region: str): + """Delete Classic ELBs and ALB/NLBs for OpenShift cluster.""" + try: + elb = boto3.client("elb", region_name=region) + elbv2 = boto3.client("elbv2", region_name=region) + ec2 = boto3.client("ec2", region_name=region) + + # Get VPC ID for cluster + vpcs = ec2.describe_vpcs( + Filters=[ + {"Name": "tag:kubernetes.io/cluster/" + infra_id, "Values": ["owned"]} + ] + )["Vpcs"] + vpc_id = vpcs[0]["VpcId"] if vpcs else None + + # Delete Classic ELBs + classic_elbs = elb.describe_load_balancers().get("LoadBalancerDescriptions", []) + for lb in classic_elbs: + if infra_id in lb["LoadBalancerName"] or ( + vpc_id and lb.get("VPCId") == vpc_id + ): + if DRY_RUN: + logger.info( + f"[DRY-RUN] Would DELETE load_balancer {lb['LoadBalancerName']} for cluster {infra_id}" + ) + else: + elb.delete_load_balancer(LoadBalancerName=lb["LoadBalancerName"]) + logger.info( + f"DELETE load_balancer {lb['LoadBalancerName']} for cluster {infra_id}" + ) + + # Delete ALB/NLBs + alb_nlbs = elbv2.describe_load_balancers().get("LoadBalancers", []) + for lb in alb_nlbs: + if infra_id in lb["LoadBalancerName"] or ( + vpc_id and lb.get("VpcId") == vpc_id + ): + if DRY_RUN: + logger.info( + f"[DRY-RUN] Would DELETE load_balancer {lb['LoadBalancerName']} for cluster {infra_id}" + ) + else: + elbv2.delete_load_balancer(LoadBalancerArn=lb["LoadBalancerArn"]) + logger.info( + f"DELETE load_balancer {lb['LoadBalancerName']} for cluster {infra_id}" + ) + + except Exception as e: + logger.error(f"Error deleting load balancers: {e}") diff --git a/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/openshift/detection.py b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/openshift/detection.py new file mode 100644 index 0000000000..988a24cfde --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/openshift/detection.py @@ -0,0 +1,45 @@ +"""OpenShift cluster detection.""" + +from __future__ import annotations +import boto3 +from ..utils import get_logger + +logger = get_logger() + + +def detect_openshift_infra_id(cluster_name: str, region: str) -> str | None: + """Detect OpenShift infrastructure ID from cluster name.""" + try: + ec2 = boto3.client("ec2", region_name=region) + + # Try exact match first + vpcs = ec2.describe_vpcs( + Filters=[ + {"Name": "tag-key", "Values": [f"kubernetes.io/cluster/{cluster_name}"]} + ] + )["Vpcs"] + + # Try wildcard match if exact doesn't work + if not vpcs: + vpcs = ec2.describe_vpcs( + Filters=[ + { + "Name": "tag-key", + "Values": [f"kubernetes.io/cluster/{cluster_name}-*"], + } + ] + )["Vpcs"] + + if vpcs: + for tag in vpcs[0].get("Tags", []): + if tag["Key"].startswith("kubernetes.io/cluster/"): + infra_id: str = tag["Key"].split("/")[-1] + logger.info( + f"Detected OpenShift infra ID: {infra_id} from cluster: {cluster_name}" + ) + return infra_id + + except Exception as e: + logger.error(f"Error detecting OpenShift infra ID: {e}") + + return None diff --git a/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/openshift/dns.py b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/openshift/dns.py new file mode 100644 index 0000000000..19e5dc14f3 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/openshift/dns.py @@ -0,0 +1,54 @@ +"""OpenShift Route53 DNS cleanup.""" + +import boto3 +from ..models.config import DRY_RUN, OPENSHIFT_BASE_DOMAIN +from ..utils import get_logger + +logger = get_logger() + + +def cleanup_route53_records(cluster_name: str, region: str): + """Clean up Route53 DNS records for OpenShift cluster.""" + try: + route53 = boto3.client("route53") + + # Find the hosted zone for the base domain + zones = route53.list_hosted_zones()["HostedZones"] + zone_id = None + for zone in zones: + if zone["Name"].rstrip(".") == OPENSHIFT_BASE_DOMAIN: + zone_id = zone["Id"].split("/")[-1] + break + + if not zone_id: + logger.warning(f"Hosted zone for {OPENSHIFT_BASE_DOMAIN} not found") + return + + # Get all DNS records for this zone + records = route53.list_resource_record_sets(HostedZoneId=zone_id)[ + "ResourceRecordSets" + ] + + # Find records for this cluster + changes = [] + for record in records: + name = record["Name"].rstrip(".") + # Match api.cluster.domain or *.apps.cluster.domain + if ( + f"api.{cluster_name}.{OPENSHIFT_BASE_DOMAIN}" in name + or f"apps.{cluster_name}.{OPENSHIFT_BASE_DOMAIN}" in name + ): + changes.append({"Action": "DELETE", "ResourceRecordSet": record}) + + if changes and not DRY_RUN: + route53.change_resource_record_sets( + HostedZoneId=zone_id, ChangeBatch={"Changes": changes} + ) + logger.info(f"Deleted {len(changes)} Route53 records for {cluster_name}") + elif changes: + logger.info( + f"[DRY-RUN] Would delete {len(changes)} Route53 records for {cluster_name}" + ) + + except Exception as e: + logger.error(f"Error cleaning up Route53 records: {e}") diff --git a/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/openshift/network.py b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/openshift/network.py new file mode 100644 index 0000000000..6ba27552ed --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/openshift/network.py @@ -0,0 +1,358 @@ +"""OpenShift network resources cleanup.""" + +import boto3 +from botocore.exceptions import ClientError +from ..models.config import DRY_RUN +from ..utils import get_logger + +logger = get_logger() + + +def delete_nat_gateways(infra_id: str, region: str): + """Delete NAT gateways for OpenShift cluster.""" + try: + ec2 = boto3.client("ec2", region_name=region) + nat_gws = ec2.describe_nat_gateways( + Filters=[ + {"Name": "tag:kubernetes.io/cluster/" + infra_id, "Values": ["owned"]}, + {"Name": "state", "Values": ["available", "pending"]}, + ] + )["NatGateways"] + + for nat in nat_gws: + if DRY_RUN: + logger.info( + "Would DELETE nat_gateway", + extra={ + "dry_run": True, + "nat_gateway_id": nat["NatGatewayId"], + "infra_id": infra_id, + }, + ) + else: + ec2.delete_nat_gateway(NatGatewayId=nat["NatGatewayId"]) + logger.info( + "DELETE nat_gateway", + extra={"nat_gateway_id": nat["NatGatewayId"], "infra_id": infra_id}, + ) + + except Exception as e: + logger.error("Error deleting NAT gateways", extra={"error": str(e)}) + + +def release_elastic_ips(infra_id: str, region: str): + """Release Elastic IPs for OpenShift cluster.""" + try: + ec2 = boto3.client("ec2", region_name=region) + eips = ec2.describe_addresses( + Filters=[ + {"Name": "tag:kubernetes.io/cluster/" + infra_id, "Values": ["owned"]} + ] + )["Addresses"] + + for eip in eips: + if "AllocationId" in eip: + if DRY_RUN: + logger.info( + "Would DELETE elastic_ip", + extra={ + "dry_run": True, + "allocation_id": eip["AllocationId"], + "infra_id": infra_id, + }, + ) + else: + try: + ec2.release_address(AllocationId=eip["AllocationId"]) + logger.info( + "DELETE elastic_ip", + extra={ + "allocation_id": eip["AllocationId"], + "infra_id": infra_id, + }, + ) + except ClientError: + pass # May already be released + + except Exception as e: + logger.error("Error releasing EIPs", extra={"error": str(e)}) + + +def cleanup_network_interfaces(vpc_id: str, region: str): + """Clean up orphaned network interfaces.""" + try: + ec2 = boto3.client("ec2", region_name=region) + enis = ec2.describe_network_interfaces( + Filters=[ + {"Name": "vpc-id", "Values": [vpc_id]}, + {"Name": "status", "Values": ["available"]}, + ] + )["NetworkInterfaces"] + + for eni in enis: + if DRY_RUN: + logger.info( + "Would DELETE network_interface", + extra={ + "dry_run": True, + "network_interface_id": eni["NetworkInterfaceId"], + "vpc_id": vpc_id, + }, + ) + else: + try: + ec2.delete_network_interface( + NetworkInterfaceId=eni["NetworkInterfaceId"] + ) + logger.info( + "DELETE network_interface", + extra={ + "network_interface_id": eni["NetworkInterfaceId"], + "vpc_id": vpc_id, + }, + ) + except ClientError: + pass # May already be deleted + + except Exception as e: + logger.error("Error cleaning up ENIs", extra={"error": str(e)}) + + +def delete_vpc_endpoints(vpc_id: str, region: str): + """Delete VPC endpoints.""" + try: + ec2 = boto3.client("ec2", region_name=region) + endpoints = ec2.describe_vpc_endpoints( + Filters=[{"Name": "vpc-id", "Values": [vpc_id]}] + )["VpcEndpoints"] + + for endpoint in endpoints: + if DRY_RUN: + logger.info( + "Would DELETE vpc_endpoint", + extra={ + "dry_run": True, + "vpc_endpoint_id": endpoint["VpcEndpointId"], + "vpc_id": vpc_id, + }, + ) + else: + try: + ec2.delete_vpc_endpoints(VpcEndpointIds=[endpoint["VpcEndpointId"]]) + logger.info( + "DELETE vpc_endpoint", + extra={ + "vpc_endpoint_id": endpoint["VpcEndpointId"], + "vpc_id": vpc_id, + }, + ) + except ClientError: + pass + + except Exception as e: + logger.error("Error deleting VPC endpoints", extra={"error": str(e)}) + + +def delete_security_groups(vpc_id: str, region: str): + """Delete security groups with dependency handling.""" + try: + ec2 = boto3.client("ec2", region_name=region) + sgs = ec2.describe_security_groups( + Filters=[{"Name": "vpc-id", "Values": [vpc_id]}] + )["SecurityGroups"] + + # First pass: remove all ingress rules to break circular dependencies + for sg in sgs: + if sg["GroupName"] == "default": + continue + try: + if sg.get("IpPermissions"): + if not DRY_RUN: + ec2.revoke_security_group_ingress( + GroupId=sg["GroupId"], IpPermissions=sg["IpPermissions"] + ) + except ClientError: + pass + + # Second pass: delete security groups + for sg in sgs: + if sg["GroupName"] == "default": + continue + if DRY_RUN: + logger.info( + "Would DELETE security_group", + extra={ + "dry_run": True, + "security_group_id": sg["GroupId"], + "vpc_id": vpc_id, + }, + ) + else: + try: + ec2.delete_security_group(GroupId=sg["GroupId"]) + logger.info( + "DELETE security_group", + extra={"security_group_id": sg["GroupId"], "vpc_id": vpc_id}, + ) + except ClientError: + pass + + except Exception as e: + logger.error("Error deleting security groups", extra={"error": str(e)}) + + +def delete_subnets(vpc_id: str, region: str): + """Delete subnets.""" + try: + ec2 = boto3.client("ec2", region_name=region) + subnets = ec2.describe_subnets( + Filters=[{"Name": "vpc-id", "Values": [vpc_id]}] + )["Subnets"] + + for subnet in subnets: + if DRY_RUN: + logger.info( + "Would DELETE subnet", + extra={ + "dry_run": True, + "subnet_id": subnet["SubnetId"], + "vpc_id": vpc_id, + }, + ) + else: + try: + ec2.delete_subnet(SubnetId=subnet["SubnetId"]) + logger.info( + "DELETE subnet", + extra={"subnet_id": subnet["SubnetId"], "vpc_id": vpc_id}, + ) + except ClientError: + pass + + except Exception as e: + logger.error("Error deleting subnets", extra={"error": str(e)}) + + +def delete_route_tables(vpc_id: str, region: str): + """Delete route tables.""" + try: + ec2 = boto3.client("ec2", region_name=region) + rts = ec2.describe_route_tables( + Filters=[{"Name": "vpc-id", "Values": [vpc_id]}] + )["RouteTables"] + + for rt in rts: + # Skip main route table + is_main = any( + assoc.get("Main", False) for assoc in rt.get("Associations", []) + ) + if is_main: + continue + + if DRY_RUN: + logger.info( + "Would DELETE route_table", + extra={ + "dry_run": True, + "route_table_id": rt["RouteTableId"], + "vpc_id": vpc_id, + }, + ) + else: + try: + ec2.delete_route_table(RouteTableId=rt["RouteTableId"]) + logger.info( + "DELETE route_table", + extra={"route_table_id": rt["RouteTableId"], "vpc_id": vpc_id}, + ) + except ClientError: + pass + + except Exception as e: + logger.error("Error deleting route tables", extra={"error": str(e)}) + + +def delete_internet_gateway(vpc_id: str, region: str): + """Detach and delete internet gateway.""" + try: + ec2 = boto3.client("ec2", region_name=region) + igws = ec2.describe_internet_gateways( + Filters=[{"Name": "attachment.vpc-id", "Values": [vpc_id]}] + )["InternetGateways"] + + for igw in igws: + if DRY_RUN: + logger.info( + "Would DELETE internet_gateway", + extra={ + "dry_run": True, + "internet_gateway_id": igw["InternetGatewayId"], + "vpc_id": vpc_id, + }, + ) + else: + try: + ec2.detach_internet_gateway( + InternetGatewayId=igw["InternetGatewayId"], VpcId=vpc_id + ) + ec2.delete_internet_gateway( + InternetGatewayId=igw["InternetGatewayId"] + ) + logger.info( + "DELETE internet_gateway", + extra={ + "internet_gateway_id": igw["InternetGatewayId"], + "vpc_id": vpc_id, + }, + ) + except ClientError: + pass + + except Exception as e: + logger.error("Error deleting IGW", extra={"error": str(e)}) + + +def delete_vpc(vpc_id: str, region: str) -> bool: + """ + Delete VPC. + + Returns: + True if VPC was deleted successfully + False if VPC still has dependencies + """ + try: + ec2 = boto3.client("ec2", region_name=region) + if DRY_RUN: + logger.info( + "Would DELETE vpc", + extra={"dry_run": True, "vpc_id": vpc_id, "region": region}, + ) + return True # In DRY_RUN, assume success + else: + try: + ec2.delete_vpc(VpcId=vpc_id) + logger.info("DELETE vpc", extra={"vpc_id": vpc_id, "region": region}) + return True + except ClientError as e: + error_code = e.response.get("Error", {}).get("Code", "") + if error_code == "DependencyViolation": + logger.info( + "VPC still has dependencies, cannot delete yet", + extra={"vpc_id": vpc_id, "error_code": error_code}, + ) + return False + else: + # Other errors (permissions, etc.) should be logged + logger.error( + "Error deleting VPC", + extra={ + "vpc_id": vpc_id, + "error": str(e), + "error_code": error_code, + }, + ) + return False + + except Exception as e: + logger.error("Unexpected error deleting VPC", extra={"error": str(e)}) + return False diff --git a/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/openshift/orchestrator.py b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/openshift/orchestrator.py new file mode 100644 index 0000000000..74f77e4f53 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/openshift/orchestrator.py @@ -0,0 +1,136 @@ +"""OpenShift cluster destruction orchestration. + +Single-pass cleanup with dependency order enforcement. +EventBridge schedule (every 15 minutes) handles retries naturally. +""" + +import boto3 +from botocore.exceptions import ClientError +from ..utils import get_logger +from .compute import delete_load_balancers +from .network import ( + delete_nat_gateways, + release_elastic_ips, + cleanup_network_interfaces, + delete_vpc_endpoints, + delete_security_groups, + delete_subnets, + delete_route_tables, + delete_internet_gateway, + delete_vpc, +) +from .dns import cleanup_route53_records +from .storage import cleanup_s3_state + +logger = get_logger() + + +def destroy_openshift_cluster(cluster_name: str, infra_id: str, region: str) -> bool: + """ + Single-pass OpenShift cluster cleanup. + + Deletes resources in dependency order. If resources still have dependencies, + exits gracefully and relies on next EventBridge schedule (15min) to retry. + + Returns: + True if VPC successfully deleted (cleanup complete) + False if resources remain (will retry on next schedule) + """ + logger.info( + "Starting OpenShift cluster cleanup", + extra={ + "cluster_name": cluster_name, + "infra_id": infra_id, + "cluster_type": "openshift", + "region": region, + }, + ) + + try: + ec2 = boto3.client("ec2", region_name=region) + + # Check if VPC still exists + vpcs = ec2.describe_vpcs( + Filters=[ + { + "Name": "tag:kubernetes.io/cluster/" + infra_id, + "Values": ["owned"], + } + ] + )["Vpcs"] + + if not vpcs: + logger.info( + "VPC not found - cleanup complete", + extra={"cluster_name": cluster_name, "infra_id": infra_id}, + ) + # Clean up Route53 and S3 when VPC is gone + cleanup_route53_records(cluster_name, region) + cleanup_s3_state(cluster_name, region) + return True + + vpc_id = vpcs[0]["VpcId"] + logger.info( + "Found VPC, proceeding with cleanup", + extra={"cluster_name": cluster_name, "vpc_id": vpc_id}, + ) + + # Delete resources in dependency order + # Each function handles its own DependencyViolation errors gracefully + delete_load_balancers(infra_id, region) + delete_nat_gateways(infra_id, region) + release_elastic_ips(infra_id, region) + cleanup_network_interfaces(vpc_id, region) + delete_vpc_endpoints(vpc_id, region) + delete_security_groups(vpc_id, region) + delete_subnets(vpc_id, region) + delete_route_tables(vpc_id, region) + delete_internet_gateway(vpc_id, region) + + # Try to delete VPC - if it fails due to dependencies, we'll retry on next run + vpc_deleted = delete_vpc(vpc_id, region) + + if vpc_deleted: + logger.info( + "Successfully deleted VPC", + extra={"cluster_name": cluster_name, "vpc_id": vpc_id}, + ) + # Clean up Route53 and S3 when VPC is successfully deleted + cleanup_route53_records(cluster_name, region) + cleanup_s3_state(cluster_name, region) + return True + else: + logger.info( + "VPC still has dependencies, will retry on next schedule", + extra={ + "cluster_name": cluster_name, + "vpc_id": vpc_id, + "retry_interval_minutes": 15, + }, + ) + return False + + except ClientError as e: + error_code = e.response.get("Error", {}).get("Code", "") + if error_code == "DependencyViolation": + logger.info( + "Dependencies remain, will retry on next schedule", + extra={"cluster_name": cluster_name, "error_code": error_code}, + ) + return False + else: + logger.error( + "Error during OpenShift cleanup", + extra={ + "cluster_name": cluster_name, + "error": str(e), + "error_code": error_code, + }, + ) + raise + except Exception as e: + logger.error( + "Unexpected error during OpenShift cleanup", + extra={"cluster_name": cluster_name, "error": str(e)}, + ) + raise diff --git a/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/openshift/storage.py b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/openshift/storage.py new file mode 100644 index 0000000000..11276f5ed8 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/openshift/storage.py @@ -0,0 +1,42 @@ +"""OpenShift S3 state storage cleanup.""" + +import boto3 +from botocore.exceptions import ClientError +from ..models.config import DRY_RUN +from ..utils import get_logger + +logger = get_logger() + + +def cleanup_s3_state(cluster_name: str, region: str): + """Clean up S3 state bucket for OpenShift cluster.""" + try: + s3 = boto3.client("s3", region_name=region) + sts = boto3.client("sts") + + # Determine S3 bucket name (standard naming convention) + account_id = sts.get_caller_identity()["Account"] + bucket_name = f"openshift-clusters-{account_id}-{region}" + + try: + # List objects with cluster name prefix + objects = s3.list_objects_v2(Bucket=bucket_name, Prefix=f"{cluster_name}/") + + if "Contents" in objects: + if DRY_RUN: + logger.info( + f"[DRY-RUN] Would delete {len(objects['Contents'])} " + f"S3 objects for {cluster_name}" + ) + else: + for obj in objects["Contents"]: + s3.delete_object(Bucket=bucket_name, Key=obj["Key"]) + logger.info(f"Deleted S3 state for {cluster_name}") + except ClientError as e: + if "NoSuchBucket" in str(e): + logger.info(f"S3 bucket {bucket_name} does not exist") + else: + raise + + except Exception as e: + logger.error(f"Error cleaning up S3 state: {e}") diff --git a/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/requirements.txt b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/requirements.txt new file mode 100644 index 0000000000..85c16d7a50 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/requirements.txt @@ -0,0 +1,6 @@ +# AWS SDK (included in Lambda runtime, but specified for local development) +boto3>=1.40.53 +botocore>=1.40.53 + +# AWS Lambda Powertools for observability +aws-lambda-powertools[tracer]>=3.3.0 diff --git a/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/utils/__init__.py b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/utils/__init__.py new file mode 100644 index 0000000000..dbb464a4eb --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/utils/__init__.py @@ -0,0 +1,15 @@ +"""Utility functions for EC2 cleanup Lambda.""" + +from .aws_helpers import ( + convert_tags_to_dict, + has_valid_billing_tag, + extract_cluster_name, +) +from .logging_config import get_logger + +__all__ = [ + "convert_tags_to_dict", + "has_valid_billing_tag", + "extract_cluster_name", + "get_logger", +] diff --git a/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/utils/aws_helpers.py b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/utils/aws_helpers.py new file mode 100644 index 0000000000..77006c0fc7 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/utils/aws_helpers.py @@ -0,0 +1,64 @@ +"""AWS helper functions.""" + +from __future__ import annotations +import datetime +from typing import Any +from .logging_config import get_logger + +logger = get_logger() + + +def convert_tags_to_dict(tags: list[dict[str, str]] | None) -> dict[str, str]: + """Convert AWS tag list to dictionary.""" + return {tag["Key"]: tag["Value"] for tag in tags} if tags else {} + + +def has_valid_billing_tag( + tags_dict: dict[str, str], instance_launch_time: Any = None +) -> bool: + """ + Check if instance has a valid iit-billing-tag. + + For regular instances: any non-empty value is valid + For timestamp-based tags: check if Unix timestamp is in the future + """ + if "iit-billing-tag" not in tags_dict: + return False + + tag_value = tags_dict["iit-billing-tag"] + + # Empty tag is invalid + if not tag_value: + return False + + # Try to parse as Unix timestamp (for EKS auto-expiration) + try: + expiration_timestamp = int(tag_value) + current_timestamp = int( + datetime.datetime.now(datetime.timezone.utc).timestamp() + ) + + # If it's a valid future timestamp, check if it's expired + if expiration_timestamp > current_timestamp: + return True + else: + logger.debug( + "Billing tag expired", + extra={ + "expiration_timestamp": expiration_timestamp, + "current_timestamp": current_timestamp, + "expired_seconds_ago": current_timestamp - expiration_timestamp, + }, + ) + return False + except ValueError: + # Not a timestamp, treat as category string (e.g., "pmm-staging", "CirrusCI") + return True + + +def extract_cluster_name(tags_dict: dict[str, str]) -> str | None: + """Extract cluster name from kubernetes tags.""" + for key in tags_dict.keys(): + if key.startswith("kubernetes.io/cluster/"): + return key.split("/")[-1] + return tags_dict.get("aws:eks:cluster-name") diff --git a/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/utils/logging_config.py b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/utils/logging_config.py new file mode 100644 index 0000000000..9e359ed2c9 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/lambda/aws_resource_cleanup/utils/logging_config.py @@ -0,0 +1,26 @@ +"""Logging configuration using AWS Lambda Powertools.""" + +import os + +from aws_lambda_powertools import Logger + +# Read log level from environment (default to INFO) +LOG_LEVEL = os.environ.get("LOG_LEVEL", "INFO").upper() + +# Set up Powertools logger with service name +# This provides structured logging with automatic Lambda context injection +logger = Logger( + service="aws-resource-cleanup", + level=LOG_LEVEL, +) + + +def get_logger(): + """Get the configured logger instance. + + Returns Powertools Logger with: + - Structured JSON logging + - Automatic Lambda context (request_id, function_name, etc.) + - CloudWatch Logs Insights ready + """ + return logger diff --git a/IaC/cdk/aws-resources-cleanup/mypy.ini b/IaC/cdk/aws-resources-cleanup/mypy.ini new file mode 100644 index 0000000000..131846f2e9 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/mypy.ini @@ -0,0 +1,12 @@ +[mypy] +python_version = 3.12 +warn_return_any = True +warn_unused_configs = True +disallow_untyped_defs = False +ignore_missing_imports = True + +[mypy-boto3.*] +ignore_missing_imports = True + +[mypy-botocore.*] +ignore_missing_imports = True diff --git a/IaC/cdk/aws-resources-cleanup/requirements.txt b/IaC/cdk/aws-resources-cleanup/requirements.txt new file mode 100644 index 0000000000..9734ecb03e --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/requirements.txt @@ -0,0 +1,19 @@ +# CDK core and constructs +aws-cdk-lib>=2.220.0 +constructs>=10.4.2 + +# Python requirements +boto3>=1.40.53 + +# Testing +pytest>=8.4.2 +pytest-cov>=7.0.0 +moto>=5.1.14 + +# Linting and formatting +ruff>=0.14.0 +black>=25.9.0 +mypy>=1.18.2 + +# AWS CLI and CDK CLI +awscli>=1.42.53 diff --git a/IaC/cdk/aws-resources-cleanup/stacks/__init__.py b/IaC/cdk/aws-resources-cleanup/stacks/__init__.py new file mode 100644 index 0000000000..11b6ce5901 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/stacks/__init__.py @@ -0,0 +1,5 @@ +"""CDK stacks for AWS resource cleanup.""" + +from .resource_cleanup_stack import ResourceCleanupStack + +__all__ = ['ResourceCleanupStack'] diff --git a/IaC/cdk/aws-resources-cleanup/stacks/resource_cleanup_stack.py b/IaC/cdk/aws-resources-cleanup/stacks/resource_cleanup_stack.py new file mode 100644 index 0000000000..7bb157fc77 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/stacks/resource_cleanup_stack.py @@ -0,0 +1,347 @@ +"""CDK Stack for AWS Resource Cleanup Lambda.""" + +from aws_cdk import ( + Stack, + Duration, + aws_lambda as lambda_, + aws_iam as iam, + aws_sns as sns, + aws_sns_subscriptions as subscriptions, + aws_events as events, + aws_events_targets as targets, + aws_logs as logs, + aws_cloudwatch as cloudwatch, + aws_cloudwatch_actions as cw_actions, + CfnParameter, + CfnOutput, + Tags +) +from constructs import Construct + + +class ResourceCleanupStack(Stack): + """ + CDK Stack for comprehensive AWS resource cleanup. + + Manages EC2 instances, EBS volumes, EKS clusters, and OpenShift infrastructure with: + - TTL-based policies + - Billing tag validation + - Unattached volume cleanup (available EBS volumes) + - Cluster-aware cleanup (EKS CloudFormation, OpenShift VPC/ELB/Route53/S3) + - Configurable dry-run mode + """ + + def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None: + super().__init__(scope, construct_id, **kwargs) + + # Parameters + dry_run_param = CfnParameter( + self, "DryRunMode", + type="String", + default="true", + allowed_values=["true", "false"], + description="[SAFETY] Safe mode - logs all actions without executing them. Set to 'false' only when ready for actual resource deletion. Always test with 'true' first." + ) + + notification_email_param = CfnParameter( + self, "NotificationEmail", + type="String", + default="", + description="[NOTIFICATIONS] Email address for cleanup action reports. Leave empty to disable SNS notifications. Subscribe to SNS topic manually after deployment." + ) + + untagged_threshold_param = CfnParameter( + self, "UntaggedThresholdMinutes", + type="Number", + default=30, + min_value=10, + max_value=1440, + description="[POLICY] Grace period in minutes before terminating instances without iit-billing-tag. Default 30 minutes. Range: 10-1440 minutes (24 hours max)." + ) + + stopped_threshold_param = CfnParameter( + self, "StoppedThresholdDays", + type="Number", + default=30, + min_value=7, + max_value=180, + description="[POLICY] Days a stopped instance can remain before termination. Default 30 days. Range: 7-180 days. Helps reduce costs from forgotten stopped instances." + ) + + eks_cleanup_param = CfnParameter( + self, "EKSCleanupEnabled", + type="String", + default="true", + allowed_values=["true", "false"], + description="[EKS] Enable full EKS cluster deletion via CloudFormation stack removal (eksctl-* stacks). When disabled, only terminates EC2 nodes." + ) + + eks_skip_pattern_param = CfnParameter( + self, "EKSSkipPattern", + type="String", + default="pe-.*", + description="[EKS] Regex pattern for cluster names to protect from deletion. Default 'pe-.*' protects production environment clusters. Use '(?!)' to disable protection." + ) + + openshift_cleanup_param = CfnParameter( + self, "OpenShiftCleanupEnabled", + type="String", + default="true", + allowed_values=["true", "false"], + description="[OPENSHIFT] Enable comprehensive OpenShift cluster cleanup including VPC, load balancers, Route53 DNS, and S3 buckets. When disabled, only terminates EC2 nodes." + ) + + openshift_domain_param = CfnParameter( + self, "OpenShiftBaseDomain", + type="String", + default="cd.percona.com", + description="[OPENSHIFT] Base domain for Route53 DNS record cleanup. Only records under this domain will be removed. Must match your OpenShift installation domain." + ) + + volume_cleanup_param = CfnParameter( + self, "VolumeCleanupEnabled", + type="String", + default="true", + allowed_values=["true", "false"], + description="[VOLUMES] Enable cleanup of unattached (available) EBS volumes. Only deletes volumes without protection tags (PerconaKeep, valid billing tags, 'do not remove' in name)." + ) + + # Scheduling + schedule_rate_param = CfnParameter( + self, "ScheduleRateMinutes", + type="Number", + default=15, + description="[SCHEDULING] Execution frequency in minutes. Lambda scans all target regions at this interval. Recommended: 15 for normal use, 5 for aggressive cleanup, 60 for light monitoring." + ) + + # Advanced Cleanup + regions_param = CfnParameter( + self, "TargetRegions", + type="String", + default="all", + description="[REGION FILTER] Target AWS regions to scan. Use 'all' for all regions, or comma-separated list (e.g., 'us-east-1,us-west-2'). Reduces execution time when limiting to specific regions." + ) + + # Logging + log_retention_param = CfnParameter( + self, "LogRetentionDays", + type="Number", + default=30, + description="[LOGGING] CloudWatch log retention period in days. Valid options: 1, 3, 7, 14, 30, 60, 90, 120, 180. Affects storage costs - longer retention = higher costs." + ) + + log_level_param = CfnParameter( + self, "LogLevel", + type="String", + default="INFO", + allowed_values=["DEBUG", "INFO", "WARNING", "ERROR"], + description="[LOGGING] Log verbosity. DEBUG = detailed (all protection decisions), INFO = standard (actions + summaries), WARNING = issues only, ERROR = failures only." + ) + + # SNS Topic for notifications + # Note: Subscription must be added manually via AWS Console or CLI + # CDK cannot conditionally create subscriptions based on parameter values + sns_topic = sns.Topic( + self, "CleanupNotificationTopic", + topic_name="AWSResourceCleanupNotifications", + display_name="AWS Resource Cleanup Notifications" + ) + + Tags.of(sns_topic).add("iit-billing-tag", "removeUntaggedEc2") + + # IAM Role for Lambda + lambda_role = iam.Role( + self, "ResourceCleanupRole", + role_name="RoleAWSResourceCleanup", + assumed_by=iam.ServicePrincipal("lambda.amazonaws.com"), + managed_policies=[ + iam.ManagedPolicy.from_aws_managed_policy_name( + "service-role/AWSLambdaBasicExecutionRole" + ) + ] + ) + + Tags.of(lambda_role).add("iit-billing-tag", "removeUntaggedEc2") + + # IAM Policy for Lambda + lambda_role.add_to_policy(iam.PolicyStatement( + effect=iam.Effect.ALLOW, + actions=[ + "ec2:DescribeRegions", + "ec2:DescribeInstances", + "ec2:TerminateInstances", + "ec2:StopInstances", + "ec2:CreateTags", + "ec2:DescribeVolumes", + "ec2:DeleteVolume", + "eks:DescribeCluster", + "eks:ListClusters", + "cloudformation:DescribeStacks", + "cloudformation:DescribeStackEvents", + "cloudformation:DeleteStack", + "ec2:DescribeSecurityGroups", + "ec2:RevokeSecurityGroupIngress", + "ec2:DeleteSecurityGroup", + "ec2:DisassociateRouteTable", + "ec2:DeleteRoute", + "ec2:DeleteRouteTable", + "ec2:DescribeVpcs", + "ec2:DeleteVpc", + "ec2:DescribeSubnets", + "ec2:DeleteSubnet", + "ec2:DescribeInternetGateways", + "ec2:DetachInternetGateway", + "ec2:DeleteInternetGateway", + "ec2:DescribeNatGateways", + "ec2:DeleteNatGateway", + "ec2:DescribeAddresses", + "ec2:ReleaseAddress", + "ec2:DescribeNetworkInterfaces", + "ec2:DeleteNetworkInterface", + "ec2:DescribeVpcEndpoints", + "ec2:DeleteVpcEndpoints", + "ec2:DescribeRouteTables", + "elasticloadbalancing:DescribeLoadBalancers", + "elasticloadbalancing:DeleteLoadBalancer", + "elasticloadbalancing:DescribeTargetGroups", + "elasticloadbalancing:DeleteTargetGroup", + "route53:ListHostedZones", + "route53:ListResourceRecordSets", + "route53:ChangeResourceRecordSets", + "route53:GetChange", + "s3:ListBucket", + "s3:DeleteObject", + "s3:DeleteObjectVersion", + "s3:GetBucketLocation", + "sts:GetCallerIdentity" + ], + resources=["*"] + )) + + # SNS publish permission (always add, will be empty topic ARN if no email) + lambda_role.add_to_policy(iam.PolicyStatement( + effect=iam.Effect.ALLOW, + actions=["sns:Publish"], + resources=[sns_topic.topic_arn] + )) + + # Map log retention parameter to CDK enum + log_retention_mapping = { + 1: logs.RetentionDays.ONE_DAY, + 3: logs.RetentionDays.THREE_DAYS, + 7: logs.RetentionDays.ONE_WEEK, + 14: logs.RetentionDays.TWO_WEEKS, + 30: logs.RetentionDays.ONE_MONTH, + 60: logs.RetentionDays.TWO_MONTHS, + 90: logs.RetentionDays.THREE_MONTHS, + 120: logs.RetentionDays.FOUR_MONTHS, + 180: logs.RetentionDays.SIX_MONTHS, + } + + # Lambda Function + cleanup_lambda = lambda_.Function( + self, "ResourceCleanupLambda", + function_name="LambdaAWSResourceCleanup", + description="Comprehensive AWS resource cleanup: EC2, EBS volumes, EKS, OpenShift (VPC, ELB, Route53, S3)", + runtime=lambda_.Runtime.PYTHON_3_13, + architecture=lambda_.Architecture.ARM_64, + handler="aws_resource_cleanup.handler.lambda_handler", + code=lambda_.Code.from_asset("lambda"), + role=lambda_role, + timeout=Duration.seconds(600), + memory_size=1024, + reserved_concurrent_executions=1, + log_retention=log_retention_mapping.get( + log_retention_param.value_as_number, + logs.RetentionDays.ONE_MONTH + ), + environment={ + "DRY_RUN": dry_run_param.value_as_string, + "SNS_TOPIC_ARN": sns_topic.topic_arn, + "UNTAGGED_THRESHOLD_MINUTES": untagged_threshold_param.value_as_string, + "STOPPED_THRESHOLD_DAYS": stopped_threshold_param.value_as_string, + "EKS_CLEANUP_ENABLED": eks_cleanup_param.value_as_string, + "EKS_SKIP_PATTERN": eks_skip_pattern_param.value_as_string, + "OPENSHIFT_CLEANUP_ENABLED": openshift_cleanup_param.value_as_string, + "OPENSHIFT_BASE_DOMAIN": openshift_domain_param.value_as_string, + "VOLUME_CLEANUP_ENABLED": volume_cleanup_param.value_as_string, + "TARGET_REGIONS": regions_param.value_as_string, + "LOG_LEVEL": log_level_param.value_as_string + } + ) + + Tags.of(cleanup_lambda).add("iit-billing-tag", "removeUntaggedEc2") + + # EventBridge Rule (configurable schedule) + schedule_rule = events.Rule( + self, "CleanupScheduleRule", + rule_name="AWSResourceCleanupSchedule", + description=f"Executes every {schedule_rate_param.value_as_number} minutes for comprehensive AWS resource cleanup", + schedule=events.Schedule.rate(Duration.minutes(schedule_rate_param.value_as_number)), + enabled=True + ) + + # Add target with retry policy for failed invocations + schedule_rule.add_target(targets.LambdaFunction( + cleanup_lambda, + retry_attempts=2, # Retry failed invocations up to 2 times + max_event_age=Duration.hours(1) # Discard events older than 1 hour + )) + + # CloudWatch Alarms for monitoring and blast radius protection + lambda_errors_alarm = cloudwatch.Alarm( + self, "LambdaErrorsAlarm", + alarm_name="AWSResourceCleanup-LambdaErrors", + alarm_description="Alert when cleanup Lambda encounters errors", + metric=cleanup_lambda.metric_errors( + period=Duration.minutes(15), + statistic="Sum" + ), + threshold=1, + evaluation_periods=1, + treat_missing_data=cloudwatch.TreatMissingData.NOT_BREACHING + ) + lambda_errors_alarm.add_alarm_action(cw_actions.SnsAction(sns_topic)) + + # Alarm for Lambda timeout (indicates potential performance issues) + lambda_timeout_alarm = cloudwatch.Alarm( + self, "LambdaTimeoutAlarm", + alarm_name="AWSResourceCleanup-LambdaTimeout", + alarm_description="Alert when cleanup Lambda approaches timeout (>8 minutes)", + metric=cleanup_lambda.metric_duration( + period=Duration.minutes(15), + statistic="Maximum" + ), + threshold=480000, # 8 minutes in milliseconds (Lambda timeout is 10min) + evaluation_periods=1, + comparison_operator=cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD, + treat_missing_data=cloudwatch.TreatMissingData.NOT_BREACHING + ) + lambda_timeout_alarm.add_alarm_action(cw_actions.SnsAction(sns_topic)) + + # Outputs + CfnOutput( + self, "LambdaFunctionName", + description="Name of the Lambda function", + value=cleanup_lambda.function_name, + export_name="AWSResourceCleanupLambdaName" + ) + + CfnOutput( + self, "LambdaFunctionArn", + description="ARN of the Lambda function", + value=cleanup_lambda.function_arn, + export_name="AWSResourceCleanupLambdaArn" + ) + + CfnOutput( + self, "SNSTopicArn", + description="ARN of the SNS topic for notifications", + value=sns_topic.topic_arn + ) + + CfnOutput( + self, "DryRunModeOutput", + description="Current dry-run mode setting", + value=dry_run_param.value_as_string + ) diff --git a/IaC/cdk/aws-resources-cleanup/tests/README.md b/IaC/cdk/aws-resources-cleanup/tests/README.md new file mode 100644 index 0000000000..1cff25623e --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/README.md @@ -0,0 +1,342 @@ +# AWS Resource Cleanup - Test Suite + +## 📊 Test Structure + +This test suite follows the **Testing Pyramid** pattern for optimal test organization and execution speed: + +``` +tests/ +├── conftest.py # Root fixtures & fixture factories +├── pytest.ini # Pytest configuration & markers +├── unit/ # ⚡ Fast, isolated unit tests (69 tests) +│ ├── conftest.py # Unit-specific fixtures +│ ├── test_protection_logic.py # Protection detection rules +│ ├── test_billing_validation.py # Billing tag validation +│ ├── test_cluster_detection.py # Cluster name extraction +│ ├── test_tag_conversion.py # Tag format conversion +│ ├── test_policy_priority.py # Policy evaluation order +│ └── policies/ # Cleanup policy tests +│ ├── test_ttl_policy.py +│ ├── test_stop_policy.py +│ ├── test_long_stopped_policy.py +│ └── test_untagged_policy.py +├── integration/ # 🔗 Component interaction tests +│ ├── conftest.py # Integration fixtures & mocks +│ └── (to be migrated) +└── e2e/ # 🌐 Full workflow tests + ├── conftest.py # E2E fixtures + └── (to be migrated) +``` + +## 🚀 Running Tests + +### By Directory (Recommended) +```bash +# Fast unit tests only (< 1 second) +just test +# or +cd aws-resources-cleanup +PYTHONPATH=lambda:$PYTHONPATH uv run --with pytest pytest tests/unit/ -v + +# Integration tests +pytest tests/integration/ -v + +# End-to-end tests +pytest tests/e2e/ -v + +# All tests +pytest tests/ +``` + +### By Marker +```bash +# Run only unit tests +pytest -m unit + +# Run policy-specific tests +pytest -m policies + +# Run AWS-related tests +pytest -m "unit and aws" + +# Run OpenShift tests +pytest -m openshift + +# Skip slow tests +pytest -m "not slow" + +# Run smoke tests only +pytest -m smoke +``` + +### With Coverage +```bash +just test-coverage +# or +pytest --cov=aws_resource_cleanup --cov-report=html +open htmlcov/index.html +``` + +## 📝 Writing Tests + +### Using Fixture Factories + +#### `make_instance` - Flexible Test Data Creation + +The `make_instance` fixture factory replaces 8+ individual fixtures with a single, flexible function: + +```python +def test_something(make_instance): + # Simple instance + instance = make_instance(name="test", billing_tag="pmm-staging") + + # Instance with expired TTL + instance = make_instance( + ttl_expired=True, + hours_old=3, + ttl_hours=1 + ) + + # Protected instance + instance = make_instance(protected=True) + + # OpenShift cluster instance + instance = make_instance( + openshift=True, + infra_id="my-infra-123", + cluster_name="my-cluster" + ) + + # EKS cluster instance + instance = make_instance( + eks=True, + eks_cluster="my-eks-cluster" + ) + + # Custom tags + instance = make_instance( + billing_tag="pmm-staging", + owner="test-user", + **{"custom-tag": "custom-value"} + ) +``` + +**Parameters:** +- `name`: Instance name (default: "test-instance") +- `state`: Instance state (default: "running") +- `billing_tag`: Billing tag value +- `ttl_expired`: Whether TTL should be expired (default: False) +- `ttl_hours`: TTL duration in hours (default: 1) +- `hours_old`: How many hours ago instance was launched +- `days_old`: How many days ago instance was launched +- `protected`: Use protected billing tag (default: False) +- `openshift`: Add OpenShift tags (default: False) +- `eks`: Add EKS tags (default: False) +- `owner`: Owner tag +- `cluster_name`: Cluster name tag +- `stop_after_days`: Add stop-after-days tag +- `**kwargs`: Additional custom tags + +#### `time_utils` - Consistent Time Handling + +```python +def test_time_based(time_utils): + # Get times relative to current_time + three_hours_ago = time_utils.hours_ago(3) + thirty_days_ago = time_utils.days_ago(30) + twenty_minutes_ago = time_utils.seconds_ago(1200) + + # Get timestamps + ts = time_utils.timestamp() + old_ts = time_utils.timestamp(time_utils.days_ago(5)) + + # Get current time + now = time_utils.now() +``` + +### Test Organization Best Practices + +#### 1. Use Descriptive Test Names +```python +def test_instance_with_expired_ttl_creates_terminate_action(): + """Clear, descriptive name following pattern: + test___ + """ +``` + +#### 2. Follow GIVEN-WHEN-THEN Pattern +```python +def test_protection_logic(make_instance): + """ + GIVEN an instance with a persistent billing tag + WHEN is_protected is called + THEN True should be returned (instance is protected) + """ + # Arrange + instance = make_instance(protected=True) + + # Act + result = is_protected(tags_dict) + + # Assert + assert result is True +``` + +#### 3. Group Related Tests in Classes +```python +@pytest.mark.unit +@pytest.mark.policies +class TestTTLExpirationDetection: + """Test TTL expiration detection logic.""" + + def test_expired_ttl_creates_action(self): + # ... + + def test_valid_ttl_returns_none(self): + # ... +``` + +#### 4. Use Markers Appropriately +```python +@pytest.mark.unit # Automatically added by unit/conftest.py +@pytest.mark.policies # Indicates policy-related test +@pytest.mark.aws # Test involves AWS concepts +@pytest.mark.openshift # OpenShift-specific test +@pytest.mark.slow # Slow-running test (>1s) +class TestMyFeature: + # ... +``` + +## 🏷️ Available Test Markers + +| Marker | Description | Usage | +|--------|-------------|-------| +| `unit` | Fast, isolated unit tests | Auto-applied to tests/unit/ | +| `integration` | Component interaction tests | Auto-applied to tests/integration/ | +| `e2e` | End-to-end workflow tests | Auto-applied to tests/e2e/ | +| `aws` | Tests involving AWS services | Manual | +| `policies` | Cleanup policy tests | Manual | +| `openshift` | OpenShift-specific tests | Manual | +| `eks` | EKS-specific tests | Manual | +| `slow` | Slow tests (>1s) | Manual | +| `smoke` | Critical path smoke tests | Manual | + +## 📚 Test Categories + +### Unit Tests (`tests/unit/`) +**Purpose:** Test individual functions and business logic in isolation +**Speed:** < 1 second for all tests +**Mocking:** Minimal to none (pure business logic) + +**What to test:** +- Protection detection rules +- Billing tag validation +- Cluster name extraction +- Policy priority logic +- Tag conversion utilities +- Individual policy functions + +**Example:** +```python +def test_persistent_tag_is_protected(make_instance): + instance = make_instance(billing_tag="jenkins-cloud") + assert is_protected(tags_dict) is True +``` + +### Integration Tests (`tests/integration/`) +**Purpose:** Test component interactions with mocking +**Speed:** 1-5 seconds +**Mocking:** AWS services (EC2, SNS, CloudFormation) + +**What to test:** +- Action execution with AWS mocks +- Region cleanup orchestration +- Notification flow +- Error handling in execution layer + +**Example:** +```python +@patch("aws_resource_cleanup.ec2.instances.boto3.client") +def test_terminate_action_execution(mock_boto): + # Test with mocked AWS service +``` + +### End-to-End Tests (`tests/e2e/`) +**Purpose:** Test complete workflows +**Speed:** 5-10 seconds +**Mocking:** Comprehensive AWS environment + +**What to test:** +- Lambda handler entry point +- Multi-region orchestration +- Complete execution flows +- Error propagation + +## 🔧 Troubleshooting + +### Tests Not Found +```bash +# Ensure PYTHONPATH includes lambda directory +cd aws-resources-cleanup +PYTHONPATH=lambda:$PYTHONPATH pytest tests/ +``` + +### Import Errors +```bash +# Install test dependencies +uv pip install -r requirements.txt +cd lambda && uv pip install -r aws_resource_cleanup/requirements.txt +``` + +### Fixture Not Found +- Check if fixture is in the correct conftest.py +- Remember fixture scope (function, class, module, session) +- Verify fixture is imported or defined in parent conftest.py + +## 📈 Test Statistics + +**Current Status:** +- ✅ 69 unit tests passing (100%) +- ⏳ Integration tests: to be migrated +- ⏳ E2E tests: to be migrated + +**Unit Test Breakdown:** +- Protection Logic: 17 tests +- Billing Validation: 11 tests +- Cluster Detection: 5 tests +- Tag Conversion: 3 tests +- Policy Priority: 4 tests +- TTL Policy: 14 tests +- Stop Policy: 5 tests +- Long Stopped Policy: 4 tests +- Untagged Policy: 6 tests + +## 🎯 Migration Notes + +### Legacy Fixtures (Deprecated) +The following fixtures are kept for backward compatibility but should not be used in new tests: +- `instance_with_valid_billing_tag` → use `make_instance(billing_tag="pmm-staging")` +- `instance_with_expired_ttl` → use `make_instance(ttl_expired=True, hours_old=2)` +- `instance_without_billing_tag` → use `make_instance()` (no billing_tag) +- `instance_stopped_long_term` → use `make_instance(state="stopped", days_old=35)` +- `protected_instance` → use `make_instance(protected=True)` +- `openshift_cluster_instance` → use `make_instance(openshift=True)` +- `eks_cluster_instance` → use `make_instance(eks=True)` + +## 🚦 CI/CD Integration + +```bash +# Full CI pipeline +just ci + +# Or manually +just lint +just test +just synth +``` + +## 📖 Additional Resources + +- [Pytest Documentation](https://docs.pytest.org/) +- [Testing Best Practices](https://pytest-with-eric.com/pytest-best-practices/pytest-organize-tests/) +- [Test Pyramid Concept](https://martinfowler.com/articles/practical-test-pyramid.html) \ No newline at end of file diff --git a/IaC/cdk/aws-resources-cleanup/tests/__init__.py b/IaC/cdk/aws-resources-cleanup/tests/__init__.py new file mode 100644 index 0000000000..71bb083614 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/__init__.py @@ -0,0 +1 @@ +"""Unit tests for AWS resource cleanup Lambda.""" diff --git a/IaC/cdk/aws-resources-cleanup/tests/conftest.py b/IaC/cdk/aws-resources-cleanup/tests/conftest.py new file mode 100644 index 0000000000..ba21cd0a17 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/conftest.py @@ -0,0 +1,471 @@ +"""Pytest configuration and shared fixtures for AWS resource cleanup tests. + +This file contains: +1. InstanceBuilder - Builder pattern for creating test EC2 instances +2. Fixture factories - Reusable functions for creating test data (make_instance, time_utils) +3. Time utilities - Helpers for time-based test scenarios +4. Legacy fixtures - Deprecated fixtures kept for backward compatibility +""" + +from __future__ import annotations +import datetime +import pytest +from typing import Any, Callable + + +class VolumeBuilder: + """Builder pattern for creating test EBS volumes. + + This builder helps create test volume data structures with various + configurations without needing to mock AWS services. + """ + + def __init__(self): + self._volume = { + "VolumeId": "vol-test123456", + "State": "available", + "CreateTime": datetime.datetime.now(datetime.timezone.utc), + "Size": 10, + "VolumeType": "gp3", + "Tags": [], + } + + def with_volume_id(self, volume_id: str) -> VolumeBuilder: + """Set volume ID.""" + self._volume["VolumeId"] = volume_id + return self + + def with_name(self, name: str) -> VolumeBuilder: + """Set Name tag.""" + self._add_tag("Name", name) + return self + + def with_state(self, state: str) -> VolumeBuilder: + """Set volume state (available, in-use, creating, deleting).""" + self._volume["State"] = state + return self + + def with_create_time(self, create_time: datetime.datetime) -> VolumeBuilder: + """Set create time.""" + self._volume["CreateTime"] = create_time + return self + + def with_size(self, size_gb: int) -> VolumeBuilder: + """Set volume size in GB.""" + self._volume["Size"] = size_gb + return self + + def with_billing_tag(self, billing_tag: str) -> VolumeBuilder: + """Add iit-billing-tag.""" + self._add_tag("iit-billing-tag", billing_tag) + return self + + def with_tag(self, key: str, value: str) -> VolumeBuilder: + """Add custom tag.""" + self._add_tag(key, value) + return self + + def _add_tag(self, key: str, value: str): + """Internal method to add a tag.""" + self._volume["Tags"].append({"Key": key, "Value": value}) + + def build(self) -> dict[str, Any]: + """Build and return the volume dictionary.""" + return self._volume + + +class InstanceBuilder: + """Builder pattern for creating test EC2 instances. + + This builder helps create test instance data structures with various + configurations without needing to mock AWS services. + """ + + def __init__(self): + self._instance = { + "InstanceId": "i-test123456", + "State": {"Name": "running"}, + "LaunchTime": datetime.datetime.now(datetime.timezone.utc), + "Tags": [], + } + + def with_instance_id(self, instance_id: str) -> InstanceBuilder: + """Set instance ID.""" + self._instance["InstanceId"] = instance_id + return self + + def with_name(self, name: str) -> InstanceBuilder: + """Set Name tag.""" + self._add_tag("Name", name) + return self + + def with_state(self, state: str) -> InstanceBuilder: + """Set instance state (running, stopped).""" + self._instance["State"]["Name"] = state + return self + + def with_launch_time(self, launch_time: datetime.datetime) -> InstanceBuilder: + """Set launch time.""" + self._instance["LaunchTime"] = launch_time + return self + + def with_ttl_tags( + self, creation_time: int, delete_after_hours: int + ) -> InstanceBuilder: + """Add TTL tags (creation-time and delete-cluster-after-hours).""" + self._add_tag("creation-time", str(creation_time)) + self._add_tag("delete-cluster-after-hours", str(delete_after_hours)) + return self + + def with_billing_tag(self, billing_tag: str) -> InstanceBuilder: + """Add iit-billing-tag.""" + self._add_tag("iit-billing-tag", billing_tag) + return self + + def with_owner(self, owner: str) -> InstanceBuilder: + """Add owner tag.""" + self._add_tag("owner", owner) + return self + + def with_cluster_name(self, cluster_name: str) -> InstanceBuilder: + """Add cluster-name tag.""" + self._add_tag("cluster-name", cluster_name) + return self + + def with_stop_after_days(self, days: int) -> InstanceBuilder: + """Add stop-after-days tag.""" + self._add_tag("stop-after-days", str(days)) + return self + + def with_openshift_tags(self, infra_id: str) -> InstanceBuilder: + """Add OpenShift-specific tags.""" + self._add_tag("iit-billing-tag", "openshift") + self._add_tag(f"kubernetes.io/cluster/{infra_id}", "owned") + return self + + def with_eks_tags(self, cluster_name: str) -> InstanceBuilder: + """Add EKS-specific tags.""" + self._add_tag("iit-billing-tag", "eks") + self._add_tag(f"kubernetes.io/cluster/{cluster_name}", "owned") + return self + + def with_tag(self, key: str, value: str) -> InstanceBuilder: + """Add custom tag.""" + self._add_tag(key, value) + return self + + def _add_tag(self, key: str, value: str): + """Internal method to add a tag.""" + self._instance["Tags"].append({"Key": key, "Value": value}) + + def build(self) -> dict[str, Any]: + """Build and return the instance dictionary.""" + return self._instance + + +# ===== Core Fixtures ===== + + +@pytest.fixture +def instance_builder(): + """Fixture that returns a new InstanceBuilder.""" + return InstanceBuilder() + + +@pytest.fixture +def volume_builder(): + """Fixture that returns a new VolumeBuilder.""" + return VolumeBuilder() + + +@pytest.fixture +def current_time(): + """Fixture for current time as Unix timestamp.""" + return 1000000 + + +# ===== Fixture Factories ===== + + +@pytest.fixture +def make_instance(instance_builder, current_time): + """Factory fixture for creating test instances with various configurations. + + This replaces multiple similar fixtures with a single flexible factory. + + Args: + name: Instance name (default: "test-instance") + state: Instance state (default: "running") + billing_tag: Billing tag value (default: None) + ttl_expired: Whether TTL should be expired (default: False) + ttl_hours: TTL duration in hours (default: 1) + hours_old: How many hours ago instance was launched (default: 0) + days_old: How many days ago instance was launched (default: 0) + protected: Use protected billing tag (default: False) + openshift: Add OpenShift tags (default: False) + eks: Add EKS tags (default: False) + owner: Owner tag (default: None) + cluster_name: Cluster name tag (default: None) + stop_after_days: Add stop-after-days tag (default: None) + **kwargs: Additional custom tags + + Returns: + dict: Instance data structure + + Example: + # Simple instance + instance = make_instance(name="test", billing_tag="pmm-staging") + + # Expired TTL instance + instance = make_instance(ttl_expired=True, ttl_hours=1, hours_old=3) + + # Protected OpenShift instance + instance = make_instance(protected=True, openshift=True) + """ + def _make( + name: str = "test-instance", + state: str = "running", + billing_tag: str | None = None, + ttl_expired: bool = False, + ttl_hours: int = 1, + hours_old: int = 0, + days_old: int = 0, + protected: bool = False, + openshift: bool = False, + eks: bool = False, + owner: str | None = None, + cluster_name: str | None = None, + stop_after_days: int | None = None, + **kwargs + ) -> dict[str, Any]: + # Calculate launch time + total_seconds = (days_old * 86400) + (hours_old * 3600) + launch_time = datetime.datetime.fromtimestamp( + current_time - total_seconds, + tz=datetime.timezone.utc + ) + + # Build instance + builder = ( + instance_builder + .with_name(name) + .with_state(state) + .with_launch_time(launch_time) + ) + + # Apply protection + if protected: + builder = builder.with_billing_tag("jenkins-dev-pmm") + elif billing_tag: + builder = builder.with_billing_tag(billing_tag) + + # Apply TTL tags + if ttl_expired: + creation_time = current_time - (ttl_hours * 3600 + 3600) # Expired by 1 hour + builder = builder.with_ttl_tags(creation_time, ttl_hours) + + # Apply cluster tags + if openshift: + infra_id = kwargs.pop('infra_id', 'test-infra-123') + builder = builder.with_openshift_tags(infra_id) + if not cluster_name: + cluster_name = 'test-openshift' + + if eks: + eks_cluster = kwargs.pop('eks_cluster', 'test-eks-cluster') + builder = builder.with_eks_tags(eks_cluster) + if not cluster_name: + cluster_name = eks_cluster + + # Apply optional tags + if owner: + builder = builder.with_owner(owner) + if cluster_name: + builder = builder.with_cluster_name(cluster_name) + if stop_after_days is not None: + builder = builder.with_stop_after_days(stop_after_days) + + # Apply custom tags + for key, value in kwargs.items(): + builder = builder.with_tag(key, str(value)) + + return builder.build() + + return _make + + +@pytest.fixture +def time_utils(current_time): + """Utility functions for time-based test scenarios. + + Provides consistent time handling across all tests. + + Example: + # Get times relative to current_time + three_hours_ago = time_utils.hours_ago(3) + thirty_days_ago = time_utils.days_ago(30) + + # Get timestamps + ts = time_utils.timestamp() + old_ts = time_utils.timestamp(time_utils.days_ago(5)) + """ + class TimeUtils: + @staticmethod + def now() -> datetime.datetime: + """Get current time as datetime.""" + return datetime.datetime.fromtimestamp( + current_time, + tz=datetime.timezone.utc + ) + + @staticmethod + def timestamp(dt: datetime.datetime | None = None) -> int: + """Convert datetime to Unix timestamp.""" + if dt is None: + return current_time + return int(dt.timestamp()) + + @staticmethod + def hours_ago(hours: int) -> datetime.datetime: + """Get datetime N hours in the past.""" + return datetime.datetime.fromtimestamp( + current_time - (hours * 3600), + tz=datetime.timezone.utc + ) + + @staticmethod + def days_ago(days: int) -> datetime.datetime: + """Get datetime N days in the past.""" + return datetime.datetime.fromtimestamp( + current_time - (days * 86400), + tz=datetime.timezone.utc + ) + + @staticmethod + def seconds_ago(seconds: int) -> datetime.datetime: + """Get datetime N seconds in the past.""" + return datetime.datetime.fromtimestamp( + current_time - seconds, + tz=datetime.timezone.utc + ) + + return TimeUtils() + + +# ===== Legacy Fixtures (Deprecated - Use make_instance instead) ===== +# These fixtures are kept for backward compatibility during migration. +# New tests should use make_instance fixture factory. + + +@pytest.fixture +def instance_with_valid_billing_tag(instance_builder): + """Instance with valid billing tag.""" + return ( + instance_builder.with_name("test-instance") + .with_billing_tag("pmm-staging") + .with_owner("test-user") + .build() + ) + + +@pytest.fixture +def instance_with_expired_ttl(instance_builder, current_time): + """Instance with expired TTL (created 2 hours ago, TTL 1 hour).""" + creation_time = current_time - 7200 # 2 hours ago + return ( + instance_builder.with_name("expired-instance") + .with_ttl_tags(creation_time, 1) # 1 hour TTL + .with_billing_tag("test-billing") + .with_owner("test-user") + .build() + ) + + +@pytest.fixture +def instance_without_billing_tag(instance_builder): + """Instance without any billing tag.""" + now = datetime.datetime.now(datetime.timezone.utc) + old_time = now - datetime.timedelta(hours=2) + return ( + instance_builder.with_name("untagged-instance") + .with_launch_time(old_time) + .build() + ) + + +@pytest.fixture +def instance_stopped_long_term(instance_builder): + """Instance stopped for more than 30 days.""" + now = datetime.datetime.now(datetime.timezone.utc) + old_time = now - datetime.timedelta(days=35) + return ( + instance_builder.with_name("long-stopped") + .with_state("stopped") + .with_launch_time(old_time) + .with_billing_tag("test-billing") + .build() + ) + + +@pytest.fixture +def instance_with_stop_policy(instance_builder): + """Instance with stop-after-days policy.""" + now = datetime.datetime.now(datetime.timezone.utc) + old_time = now - datetime.timedelta(days=8) + return ( + instance_builder.with_name("pmm-staging") + .with_state("running") + .with_launch_time(old_time) + .with_stop_after_days(7) + .with_billing_tag("pmm-staging") + .build() + ) + + +@pytest.fixture +def protected_instance(instance_builder): + """Instance with persistent billing tag (protected).""" + return ( + instance_builder.with_name("protected-instance") + .with_billing_tag("jenkins-dev-pmm") + .build() + ) + + +@pytest.fixture +def openshift_cluster_instance(instance_builder, current_time): + """Instance that's part of an OpenShift cluster with expired TTL.""" + creation_time = current_time - 7200 # 2 hours ago + return ( + instance_builder.with_name("openshift-master") + .with_ttl_tags(creation_time, 1) + .with_openshift_tags("test-infra-123") + .with_cluster_name("test-openshift") + .with_owner("test-user") + .build() + ) + + +@pytest.fixture +def eks_cluster_instance(instance_builder, current_time): + """Instance that's part of an EKS cluster with expired TTL.""" + creation_time = current_time - 7200 # 2 hours ago + return ( + instance_builder.with_name("eks-node") + .with_ttl_tags(creation_time, 1) + .with_eks_tags("test-eks-cluster") + .with_cluster_name("test-eks-cluster") + .with_owner("test-user") + .build() + ) + + +@pytest.fixture +def tags_dict_from_instance(): + """Helper function to convert instance tags to dictionary format.""" + + def _convert(instance: dict[str, Any]) -> dict[str, str]: + """Convert Tags list to dict.""" + return {tag["Key"]: tag["Value"] for tag in instance.get("Tags", [])} + + return _convert diff --git a/IaC/cdk/aws-resources-cleanup/tests/e2e/__init__.py b/IaC/cdk/aws-resources-cleanup/tests/e2e/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/IaC/cdk/aws-resources-cleanup/tests/e2e/conftest.py b/IaC/cdk/aws-resources-cleanup/tests/e2e/conftest.py new file mode 100644 index 0000000000..8784daada7 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/e2e/conftest.py @@ -0,0 +1,9 @@ +"""Fixtures specific to end-to-end tests.""" + +import pytest + + +@pytest.fixture(autouse=True) +def _mark_as_e2e(request): + """Automatically mark all tests in e2e/ as e2e tests.""" + request.node.add_marker(pytest.mark.e2e) \ No newline at end of file diff --git a/IaC/cdk/aws-resources-cleanup/tests/e2e/test_lambda_handler.py b/IaC/cdk/aws-resources-cleanup/tests/e2e/test_lambda_handler.py new file mode 100644 index 0000000000..29f79043bb --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/e2e/test_lambda_handler.py @@ -0,0 +1,468 @@ +"""End-to-end tests for Lambda handler entry point and integration flows. + +Tests focus on: +- lambda_handler() entry point (multi-region orchestration) +- End-to-end integration flows +- Error propagation and partial failure scenarios +""" + +from __future__ import annotations +import datetime +import json +import pytest +from unittest.mock import Mock, patch, MagicMock +from botocore.exceptions import ClientError + +from aws_resource_cleanup.handler import lambda_handler, cleanup_region +from aws_resource_cleanup.models import CleanupAction + + +@pytest.fixture +def mock_lambda_context(): + """Create a mock Lambda context object.""" + context = Mock() + context.function_name = "test-function" + context.function_version = "$LATEST" + context.invoked_function_arn = "arn:aws:lambda:us-east-1:123456789012:function:test-function" + context.memory_limit_in_mb = 128 + context.aws_request_id = "test-request-id" + context.log_group_name = "/aws/lambda/test-function" + context.log_stream_name = "2024/01/01/[$LATEST]test" + context.get_remaining_time_in_millis = Mock(return_value=300000) + return context + + +@pytest.mark.e2e +@pytest.mark.aws +class TestLambdaHandlerEntryPoint: + """Test the main Lambda handler entry point.""" + + @patch("aws_resource_cleanup.handler.cleanup_region") + @patch("aws_resource_cleanup.handler.boto3.client") + def test_lambda_handler_processes_multiple_regions( + self, mock_boto_client, mock_cleanup_region, mock_lambda_context + ): + """ + GIVEN multiple AWS regions + WHEN lambda_handler is invoked + THEN all regions should be processed + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_ec2.describe_regions.return_value = { + "Regions": [ + {"RegionName": "us-east-1"}, + {"RegionName": "us-west-2"}, + {"RegionName": "eu-west-1"}, + ] + } + + # Each region returns different actions + action1 = CleanupAction( + instance_id="i-region1", + region="us-east-1", + name="test1", + action="TERMINATE", + reason="TTL expired", + days_overdue=1.0, + ) + action2 = CleanupAction( + instance_id="i-region2", + region="us-west-2", + name="test2", + action="STOP", + reason="stop-after-days", + days_overdue=0.5, + ) + mock_cleanup_region.side_effect = [[action1], [action2], []] + + result = lambda_handler({}, mock_lambda_context) + + assert result["statusCode"] == 200 + body = json.loads(result["body"]) + assert body["total_actions"] == 2 + assert body["by_action"]["TERMINATE"] == 1 + assert body["by_action"]["STOP"] == 1 + assert len(body["actions"]) == 2 + + # Verify cleanup_region was called for each region + assert mock_cleanup_region.call_count == 3 + mock_cleanup_region.assert_any_call("us-east-1") + mock_cleanup_region.assert_any_call("us-west-2") + mock_cleanup_region.assert_any_call("eu-west-1") + + @patch("aws_resource_cleanup.handler.cleanup_region") + @patch("aws_resource_cleanup.handler.boto3.client") + def test_lambda_handler_no_actions_across_all_regions( + self, mock_boto_client, mock_cleanup_region, mock_lambda_context + ): + """ + GIVEN regions with no cleanup actions needed + WHEN lambda_handler is invoked + THEN response should show zero actions + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_ec2.describe_regions.return_value = { + "Regions": [ + {"RegionName": "us-east-1"}, + {"RegionName": "us-west-2"}, + ] + } + + mock_cleanup_region.return_value = [] + + result = lambda_handler({}, mock_lambda_context) + + assert result["statusCode"] == 200 + body = json.loads(result["body"]) + assert body["total_actions"] == 0 + assert body["by_action"] == {} + assert body["actions"] == [] + + @patch("aws_resource_cleanup.handler.cleanup_region") + @patch("aws_resource_cleanup.handler.boto3.client") + @patch("aws_resource_cleanup.handler.DRY_RUN", True) + def test_lambda_handler_includes_dry_run_flag( + self, mock_boto_client, mock_cleanup_region, mock_lambda_context + ): + """ + GIVEN DRY_RUN mode enabled + WHEN lambda_handler is invoked + THEN response should indicate dry_run=true + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_ec2.describe_regions.return_value = {"Regions": [{"RegionName": "us-east-1"}]} + mock_cleanup_region.return_value = [] + + result = lambda_handler({}, mock_lambda_context) + + body = json.loads(result["body"]) + assert body["dry_run"] is True + + @patch("aws_resource_cleanup.handler.cleanup_region") + @patch("aws_resource_cleanup.handler.boto3.client") + def test_lambda_handler_aggregates_actions_correctly( + self, mock_boto_client, mock_cleanup_region, mock_lambda_context + ): + """ + GIVEN multiple regions with various action types + WHEN lambda_handler is invoked + THEN actions should be aggregated correctly by type + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_ec2.describe_regions.return_value = { + "Regions": [{"RegionName": "us-east-1"}, {"RegionName": "us-west-2"}] + } + + # Region 1: 2 TERMINATE, 1 STOP + # Region 2: 1 TERMINATE, 1 TERMINATE_CLUSTER + region1_actions = [ + CleanupAction("i-1", "us-east-1", "n1", "TERMINATE", "r1", 1.0), + CleanupAction("i-2", "us-east-1", "n2", "TERMINATE", "r2", 1.0), + CleanupAction("i-3", "us-east-1", "n3", "STOP", "r3", 0.5), + ] + region2_actions = [ + CleanupAction("i-4", "us-west-2", "n4", "TERMINATE", "r4", 2.0), + CleanupAction("i-5", "us-west-2", "n5", "TERMINATE_CLUSTER", "r5", 3.0, cluster_name="eks"), + ] + mock_cleanup_region.side_effect = [region1_actions, region2_actions] + + result = lambda_handler({}, mock_lambda_context) + + body = json.loads(result["body"]) + assert body["total_actions"] == 5 + assert body["by_action"]["TERMINATE"] == 3 + assert body["by_action"]["STOP"] == 1 + assert body["by_action"]["TERMINATE_CLUSTER"] == 1 + + @patch("aws_resource_cleanup.handler.boto3.client") + def test_lambda_handler_handles_describe_regions_failure(self, mock_boto_client, mock_lambda_context): + """ + GIVEN describe_regions API call fails + WHEN lambda_handler is invoked + THEN exception should be raised + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_ec2.describe_regions.side_effect = ClientError( + {"Error": {"Code": "RequestLimitExceeded", "Message": "Rate limit"}}, + "DescribeRegions", + ) + + with pytest.raises(ClientError): + lambda_handler({}, mock_lambda_context) + + +@pytest.mark.e2e +@pytest.mark.aws +class TestPartialFailureScenarios: + """Test error propagation and partial failure handling.""" + + @patch("aws_resource_cleanup.handler.cleanup_region") + @patch("aws_resource_cleanup.handler.boto3.client") + def test_lambda_handler_continues_after_region_failure( + self, mock_boto_client, mock_cleanup_region, mock_lambda_context + ): + """ + GIVEN one region fails but others succeed + WHEN lambda_handler is invoked + THEN successful regions should be processed and returned + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_ec2.describe_regions.return_value = { + "Regions": [ + {"RegionName": "us-east-1"}, + {"RegionName": "us-west-2"}, + {"RegionName": "eu-west-1"}, + ] + } + + action = CleanupAction("i-ok", "us-east-1", "test", "TERMINATE", "test", 1.0) + + # Region 1: success, Region 2: exception, Region 3: success + mock_cleanup_region.side_effect = [ + [action], + [], # Returns empty instead of raising to match actual behavior + [action], + ] + + result = lambda_handler({}, mock_lambda_context) + + # Should succeed with actions from regions 1 and 3 + assert result["statusCode"] == 200 + body = json.loads(result["body"]) + assert body["total_actions"] == 2 + + +@pytest.mark.e2e +@pytest.mark.aws +@pytest.mark.smoke +class TestEndToEndIntegrationFlow: + """Integration tests with minimal mocking to verify complete flow.""" + + @patch("aws_resource_cleanup.handler.execute_cleanup_action") + @patch("aws_resource_cleanup.handler.send_notification") + @patch("aws_resource_cleanup.handler.boto3.client") + def test_end_to_end_cleanup_flow_with_expired_ttl( + self, mock_boto_client, mock_send_notification, mock_execute_action, mock_lambda_context + ): + """ + GIVEN Lambda invocation with instances having expired TTL + WHEN full cleanup flow executes + THEN instances should be identified, actions created, executed, and reported + """ + # Setup EC2 mock for describe_regions + mock_ec2_main = Mock() + mock_ec2_regional = Mock() + + def get_client(service, region_name=None): + if region_name: + return mock_ec2_regional + return mock_ec2_main + + mock_boto_client.side_effect = get_client + + # Mock describe_regions - single region for simplicity + mock_ec2_main.describe_regions.return_value = { + "Regions": [{"RegionName": "us-east-1"}] + } + + # Mock describe_instances - instance with expired TTL + now = datetime.datetime.now(datetime.timezone.utc) + old_time = now - datetime.timedelta(hours=3) + current_timestamp = int(now.timestamp()) + creation_timestamp = current_timestamp - 10800 # 3 hours ago + + mock_ec2_regional.describe_instances.return_value = { + "Reservations": [ + { + "Instances": [ + { + "InstanceId": "i-expired-ttl", + "State": {"Name": "running"}, + "LaunchTime": old_time, + "Tags": [ + {"Key": "Name", "Value": "test-instance"}, + {"Key": "creation-time", "Value": str(creation_timestamp)}, + {"Key": "delete-cluster-after-hours", "Value": "1"}, + {"Key": "iit-billing-tag", "Value": "test-team"}, + {"Key": "owner", "Value": "test-user"}, + ], + } + ] + } + ] + } + + mock_execute_action.return_value = True + + # Execute Lambda + result = lambda_handler({}, mock_lambda_context) + + # Verify response + assert result["statusCode"] == 200 + body = json.loads(result["body"]) + assert body["total_actions"] == 1 + assert body["actions"][0]["instance_id"] == "i-expired-ttl" + assert body["actions"][0]["action"] == "TERMINATE" + + # Verify action was executed + mock_execute_action.assert_called_once() + executed_action = mock_execute_action.call_args[0][0] + assert executed_action.instance_id == "i-expired-ttl" + assert executed_action.action == "TERMINATE" + + # Verify notification was sent + mock_send_notification.assert_called_once() + + @patch("aws_resource_cleanup.handler.execute_cleanup_action") + @patch("aws_resource_cleanup.handler.send_notification") + @patch("aws_resource_cleanup.handler.boto3.client") + def test_end_to_end_flow_with_protected_and_actionable_instances( + self, mock_boto_client, mock_send_notification, mock_execute_action, mock_lambda_context + ): + """ + GIVEN mix of protected and actionable instances + WHEN full cleanup flow executes + THEN only actionable instances should be processed + """ + mock_ec2_main = Mock() + mock_ec2_regional = Mock() + + def get_client(service, region_name=None): + if region_name: + return mock_ec2_regional + return mock_ec2_main + + mock_boto_client.side_effect = get_client + + mock_ec2_main.describe_regions.return_value = { + "Regions": [{"RegionName": "us-east-1"}] + } + + now = datetime.datetime.now(datetime.timezone.utc) + old_time = now - datetime.timedelta(days=35) + + # Instance 1: Protected (persistent tag) + # Instance 2: Long stopped without billing tag (actionable) + mock_ec2_regional.describe_instances.return_value = { + "Reservations": [ + { + "Instances": [ + { + "InstanceId": "i-protected", + "State": {"Name": "running"}, + "LaunchTime": now, + "Tags": [ + {"Key": "Name", "Value": "protected-jenkins"}, + {"Key": "iit-billing-tag", "Value": "jenkins-cloud"}, + ], + }, + { + "InstanceId": "i-long-stopped", + "State": {"Name": "stopped"}, + "LaunchTime": old_time, + "Tags": [ + {"Key": "Name", "Value": "old-stopped"}, + ], + }, + ] + } + ] + } + + mock_execute_action.return_value = True + + result = lambda_handler({}, mock_lambda_context) + + assert result["statusCode"] == 200 + body = json.loads(result["body"]) + assert body["total_actions"] == 1 + assert body["actions"][0]["instance_id"] == "i-long-stopped" + + # Only one action should be executed (protected instance skipped) + assert mock_execute_action.call_count == 1 + + @patch("aws_resource_cleanup.handler.execute_cleanup_action") + @patch("aws_resource_cleanup.handler.send_notification") + @patch("aws_resource_cleanup.handler.boto3.client") + def test_end_to_end_flow_no_instances_in_region( + self, mock_boto_client, mock_send_notification, mock_execute_action, mock_lambda_context + ): + """ + GIVEN region with no instances + WHEN full cleanup flow executes + THEN no actions should be taken + """ + mock_ec2_main = Mock() + mock_ec2_regional = Mock() + + def get_client(service, region_name=None): + if region_name: + return mock_ec2_regional + return mock_ec2_main + + mock_boto_client.side_effect = get_client + + mock_ec2_main.describe_regions.return_value = { + "Regions": [{"RegionName": "us-east-1"}] + } + + mock_ec2_regional.describe_instances.return_value = {"Reservations": []} + + result = lambda_handler({}, mock_lambda_context) + + assert result["statusCode"] == 200 + body = json.loads(result["body"]) + assert body["total_actions"] == 0 + assert body["actions"] == [] + + mock_execute_action.assert_not_called() + mock_send_notification.assert_not_called() + + +@pytest.mark.e2e +class TestLambdaEventHandling: + """Test Lambda event validation and edge cases.""" + + @patch("aws_resource_cleanup.handler.cleanup_region") + @patch("aws_resource_cleanup.handler.boto3.client") + def test_lambda_handler_accepts_empty_event( + self, mock_boto_client, mock_cleanup_region, mock_lambda_context + ): + """ + GIVEN empty Lambda event + WHEN lambda_handler is invoked + THEN it should process normally + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_ec2.describe_regions.return_value = {"Regions": [{"RegionName": "us-east-1"}]} + mock_cleanup_region.return_value = [] + + result = lambda_handler({}, mock_lambda_context) + + assert result["statusCode"] == 200 + + @patch("aws_resource_cleanup.handler.cleanup_region") + @patch("aws_resource_cleanup.handler.boto3.client") + def test_lambda_handler_accepts_none_context( + self, mock_boto_client, mock_cleanup_region, mock_lambda_context + ): + """ + GIVEN None as Lambda context + WHEN lambda_handler is invoked + THEN it should process normally + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_ec2.describe_regions.return_value = {"Regions": [{"RegionName": "us-east-1"}]} + mock_cleanup_region.return_value = [] + + result = lambda_handler({}, mock_lambda_context) + + assert result["statusCode"] == 200 diff --git a/IaC/cdk/aws-resources-cleanup/tests/integration/__init__.py b/IaC/cdk/aws-resources-cleanup/tests/integration/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/IaC/cdk/aws-resources-cleanup/tests/integration/conftest.py b/IaC/cdk/aws-resources-cleanup/tests/integration/conftest.py new file mode 100644 index 0000000000..4581dd6088 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/integration/conftest.py @@ -0,0 +1,83 @@ +"""Fixtures specific to integration tests.""" + +import pytest +from unittest.mock import Mock + + +@pytest.fixture(autouse=True) +def _mark_as_integration(request): + """Automatically mark all tests in integration/ as integration tests.""" + request.node.add_marker(pytest.mark.integration) + + +@pytest.fixture +def mock_ec2_client(): + """Factory for creating mock EC2 clients with common behaviors. + + Example: + ec2 = mock_ec2_client( + describe_instances_response={ + "Reservations": [{"Instances": [instance_data]}] + } + ) + """ + def _create_mock(**kwargs): + mock = Mock() + mock.describe_instances.return_value = kwargs.get( + 'describe_instances_response', + {"Reservations": []} + ) + mock.terminate_instances.return_value = kwargs.get( + 'terminate_response', + {} + ) + mock.stop_instances.return_value = kwargs.get( + 'stop_response', + {} + ) + mock.describe_regions.return_value = kwargs.get( + 'describe_regions_response', + {"Regions": [{"RegionName": "us-east-1"}]} + ) + return mock + return _create_mock + + +@pytest.fixture +def mock_sns_client(): + """Factory for creating mock SNS clients. + + Example: + sns = mock_sns_client( + publish_response={'MessageId': 'test-id'} + ) + """ + def _create_mock(**kwargs): + mock = Mock() + mock.publish.return_value = kwargs.get( + 'publish_response', + {'MessageId': 'test-message-id'} + ) + return mock + return _create_mock + + +@pytest.fixture +def mock_cloudformation_client(): + """Factory for creating mock CloudFormation clients. + + Example: + cfn = mock_cloudformation_client() + """ + def _create_mock(**kwargs): + mock = Mock() + mock.delete_stack.return_value = kwargs.get( + 'delete_stack_response', + {} + ) + mock.describe_stacks.return_value = kwargs.get( + 'describe_stacks_response', + {'Stacks': []} + ) + return mock + return _create_mock \ No newline at end of file diff --git a/IaC/cdk/aws-resources-cleanup/tests/integration/test_action_execution.py b/IaC/cdk/aws-resources-cleanup/tests/integration/test_action_execution.py new file mode 100644 index 0000000000..a68461b191 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/integration/test_action_execution.py @@ -0,0 +1,935 @@ +"""Integration tests for handler orchestration and action execution. + +Tests focus on the critical execution paths with AWS mocking: +- Action execution (execute_cleanup_action) +- Region cleanup orchestration (cleanup_region) +- Error handling +""" + +from __future__ import annotations +import datetime +import pytest +from unittest.mock import Mock, patch, MagicMock +from botocore.exceptions import ClientError + +from aws_resource_cleanup.models import CleanupAction +from aws_resource_cleanup.ec2.instances import execute_cleanup_action +from aws_resource_cleanup.handler import cleanup_region + + +@pytest.mark.integration +@pytest.mark.aws +class TestExecuteCleanupAction: + """Test action execution for all action types.""" + + @patch("aws_resource_cleanup.ec2.instances.boto3.client") + @patch("aws_resource_cleanup.ec2.instances.DRY_RUN", False) + def test_terminate_action_live_mode(self, mock_boto_client): + """ + GIVEN a TERMINATE action in live mode + WHEN execute_cleanup_action is called + THEN EC2 terminate_instances should be called + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + action = CleanupAction( + instance_id="i-test123", + region="us-east-1", + name="test-instance", + action="TERMINATE", + reason="TTL expired", + days_overdue=2.5, + billing_tag="test-tag", + ) + + result = execute_cleanup_action(action, "us-east-1") + + assert result is True + mock_boto_client.assert_called_once_with("ec2", region_name="us-east-1") + mock_ec2.terminate_instances.assert_called_once_with(InstanceIds=["i-test123"]) + + @patch("aws_resource_cleanup.ec2.instances.boto3.client") + @patch("aws_resource_cleanup.ec2.instances.DRY_RUN", True) + def test_terminate_action_dry_run_mode(self, mock_boto_client): + """ + GIVEN a TERMINATE action in DRY_RUN mode + WHEN execute_cleanup_action is called + THEN no AWS API calls should be made + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + action = CleanupAction( + instance_id="i-test456", + region="us-east-1", + name="test-instance", + action="TERMINATE", + reason="Untagged", + days_overdue=1.0, + ) + + result = execute_cleanup_action(action, "us-east-1") + + assert result is True + mock_boto_client.assert_called_once_with("ec2", region_name="us-east-1") + mock_ec2.terminate_instances.assert_not_called() + + @patch("aws_resource_cleanup.ec2.instances.boto3.client") + @patch("aws_resource_cleanup.ec2.instances.DRY_RUN", False) + def test_stop_action_live_mode(self, mock_boto_client): + """ + GIVEN a STOP action in live mode + WHEN execute_cleanup_action is called + THEN EC2 stop_instances should be called + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + action = CleanupAction( + instance_id="i-test789", + region="us-west-2", + name="test-instance", + action="STOP", + reason="stop-after-days policy", + days_overdue=1.0, + billing_tag="pmm-staging", + ) + + result = execute_cleanup_action(action, "us-west-2") + + assert result is True + mock_boto_client.assert_called_once_with("ec2", region_name="us-west-2") + mock_ec2.stop_instances.assert_called_once_with(InstanceIds=["i-test789"]) + + @patch("aws_resource_cleanup.ec2.instances.boto3.client") + @patch("aws_resource_cleanup.ec2.instances.DRY_RUN", True) + def test_stop_action_dry_run_mode(self, mock_boto_client): + """ + GIVEN a STOP action in DRY_RUN mode + WHEN execute_cleanup_action is called + THEN no AWS API calls should be made + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + action = CleanupAction( + instance_id="i-test101", + region="us-west-2", + name="test-instance", + action="STOP", + reason="stop-after-days", + days_overdue=0.5, + ) + + result = execute_cleanup_action(action, "us-west-2") + + assert result is True + mock_ec2.stop_instances.assert_not_called() + + @patch("aws_resource_cleanup.ec2.instances.delete_eks_cluster_stack") + @patch("aws_resource_cleanup.ec2.instances.boto3.client") + @patch("aws_resource_cleanup.ec2.instances.DRY_RUN", False) + def test_terminate_cluster_action_live_mode( + self, mock_boto_client, mock_delete_stack + ): + """ + GIVEN a TERMINATE_CLUSTER action for EKS in live mode + WHEN execute_cleanup_action is called + THEN CloudFormation stack deletion and instance termination should occur + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_delete_stack.return_value = True + + action = CleanupAction( + instance_id="i-eks123", + region="us-east-1", + name="eks-node", + action="TERMINATE_CLUSTER", + reason="TTL expired", + days_overdue=3.0, + billing_tag="eks", + cluster_name="test-eks-cluster", + ) + + result = execute_cleanup_action(action, "us-east-1") + + assert result is True + mock_delete_stack.assert_called_once_with("test-eks-cluster", "us-east-1") + mock_ec2.terminate_instances.assert_called_once_with(InstanceIds=["i-eks123"]) + + @patch("aws_resource_cleanup.ec2.instances.delete_eks_cluster_stack") + @patch("aws_resource_cleanup.ec2.instances.boto3.client") + @patch("aws_resource_cleanup.ec2.instances.DRY_RUN", True) + def test_terminate_cluster_action_dry_run_mode( + self, mock_boto_client, mock_delete_stack + ): + """ + GIVEN a TERMINATE_CLUSTER action in DRY_RUN mode + WHEN execute_cleanup_action is called + THEN no actual deletions should occur + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + action = CleanupAction( + instance_id="i-eks456", + region="us-east-1", + name="eks-node", + action="TERMINATE_CLUSTER", + reason="TTL expired", + days_overdue=2.0, + cluster_name="test-eks-cluster", + ) + + result = execute_cleanup_action(action, "us-east-1") + + assert result is True + mock_delete_stack.assert_not_called() + mock_ec2.terminate_instances.assert_not_called() + + @patch("aws_resource_cleanup.ec2.instances.boto3.client") + def test_terminate_cluster_without_cluster_name_fails(self, mock_boto_client): + """ + GIVEN a TERMINATE_CLUSTER action without cluster_name + WHEN execute_cleanup_action is called + THEN it should return False (invalid action) + """ + action = CleanupAction( + instance_id="i-invalid", + region="us-east-1", + name="eks-node", + action="TERMINATE_CLUSTER", + reason="TTL expired", + days_overdue=1.0, + cluster_name=None, # Missing! + ) + + result = execute_cleanup_action(action, "us-east-1") + + assert result is False + + @patch("aws_resource_cleanup.ec2.instances.destroy_openshift_cluster") + @patch("aws_resource_cleanup.ec2.instances.detect_openshift_infra_id") + @patch("aws_resource_cleanup.ec2.instances.boto3.client") + @patch("aws_resource_cleanup.ec2.instances.DRY_RUN", False) + @patch("aws_resource_cleanup.models.config.OPENSHIFT_CLEANUP_ENABLED", True) + def test_terminate_openshift_cluster_action_live_mode( + self, mock_boto_client, mock_detect_infra, mock_destroy_cluster + ): + """ + GIVEN a TERMINATE_OPENSHIFT_CLUSTER action in live mode + WHEN execute_cleanup_action is called + THEN OpenShift cleanup and instance termination should occur + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_detect_infra.return_value = "openshift-infra-abc123" + + action = CleanupAction( + instance_id="i-openshift123", + region="us-east-2", + name="openshift-master", + action="TERMINATE_OPENSHIFT_CLUSTER", + reason="TTL expired", + days_overdue=4.0, + billing_tag="openshift", + cluster_name="test-openshift", + ) + + result = execute_cleanup_action(action, "us-east-2") + + assert result is True + mock_detect_infra.assert_called_once_with("test-openshift", "us-east-2") + mock_destroy_cluster.assert_called_once_with( + "test-openshift", "openshift-infra-abc123", "us-east-2" + ) + mock_ec2.terminate_instances.assert_called_once_with( + InstanceIds=["i-openshift123"] + ) + + @patch("aws_resource_cleanup.ec2.instances.detect_openshift_infra_id") + @patch("aws_resource_cleanup.ec2.instances.boto3.client") + @patch("aws_resource_cleanup.ec2.instances.DRY_RUN", True) + @patch("aws_resource_cleanup.models.config.OPENSHIFT_CLEANUP_ENABLED", True) + def test_terminate_openshift_cluster_action_dry_run_mode( + self, mock_boto_client, mock_detect_infra + ): + """ + GIVEN a TERMINATE_OPENSHIFT_CLUSTER action in DRY_RUN mode + WHEN execute_cleanup_action is called + THEN no actual deletions should occur + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_detect_infra.return_value = "openshift-infra-xyz" + + action = CleanupAction( + instance_id="i-openshift456", + region="us-east-2", + name="openshift-master", + action="TERMINATE_OPENSHIFT_CLUSTER", + reason="TTL expired", + days_overdue=3.5, + cluster_name="test-openshift", + ) + + result = execute_cleanup_action(action, "us-east-2") + + assert result is True + mock_detect_infra.assert_called_once_with("test-openshift", "us-east-2") + mock_ec2.terminate_instances.assert_not_called() + + @patch("aws_resource_cleanup.ec2.instances.boto3.client") + def test_terminate_openshift_without_cluster_name_fails(self, mock_boto_client): + """ + GIVEN a TERMINATE_OPENSHIFT_CLUSTER action without cluster_name + WHEN execute_cleanup_action is called + THEN it should return False + """ + action = CleanupAction( + instance_id="i-invalid", + region="us-east-2", + name="openshift-master", + action="TERMINATE_OPENSHIFT_CLUSTER", + reason="TTL expired", + days_overdue=1.0, + cluster_name=None, + ) + + result = execute_cleanup_action(action, "us-east-2") + + assert result is False + + @patch("aws_resource_cleanup.ec2.instances.boto3.client") + @patch("aws_resource_cleanup.ec2.instances.DRY_RUN", False) + def test_execute_action_handles_client_error(self, mock_boto_client): + """ + GIVEN a valid action that triggers ClientError + WHEN execute_cleanup_action is called + THEN it should catch the error and return False + """ + mock_ec2 = Mock() + mock_ec2.terminate_instances.side_effect = ClientError( + {"Error": {"Code": "InvalidInstanceID.NotFound", "Message": "Not found"}}, + "TerminateInstances", + ) + mock_boto_client.return_value = mock_ec2 + + action = CleanupAction( + instance_id="i-notfound", + region="us-east-1", + name="test", + action="TERMINATE", + reason="Test", + days_overdue=1.0, + ) + + result = execute_cleanup_action(action, "us-east-1") + + assert result is False + + +@pytest.mark.integration +@pytest.mark.aws +class TestCleanupRegion: + """Test region cleanup orchestration.""" + + @patch("aws_resource_cleanup.handler.execute_cleanup_action") + @patch("aws_resource_cleanup.handler.send_notification") + @patch("aws_resource_cleanup.handler.boto3.client") + def test_cleanup_region_happy_path( + self, mock_boto_client, mock_send_notification, mock_execute + ): + """ + GIVEN a region with instances that match cleanup policies + WHEN cleanup_region is called + THEN actions should be created and executed + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + # Mock EC2 response with instances + now = datetime.datetime.now(datetime.timezone.utc) + old_time = now - datetime.timedelta(days=35) + + mock_ec2.describe_instances.return_value = { + "Reservations": [ + { + "Instances": [ + { + "InstanceId": "i-stopped-old", + "State": {"Name": "stopped"}, + "LaunchTime": old_time, + "Tags": [{"Key": "Name", "Value": "long-stopped-instance"}], + } + ] + } + ] + } + + mock_execute.return_value = True + + actions = cleanup_region("us-east-1") + + assert len(actions) == 1 + assert actions[0].instance_id == "i-stopped-old" + assert actions[0].action == "TERMINATE" + assert actions[0].region == "us-east-1" + + mock_execute.assert_called_once() + mock_send_notification.assert_called_once() + + @patch("aws_resource_cleanup.handler.boto3.client") + def test_cleanup_region_no_instances(self, mock_boto_client): + """ + GIVEN a region with no instances + WHEN cleanup_region is called + THEN no actions should be created + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_ec2.describe_instances.return_value = {"Reservations": []} + + actions = cleanup_region("us-west-2") + + assert len(actions) == 0 + + @patch("aws_resource_cleanup.handler.boto3.client") + def test_cleanup_region_protected_instances_skipped(self, mock_boto_client): + """ + GIVEN a region with only protected instances + WHEN cleanup_region is called + THEN no actions should be created + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + now = datetime.datetime.now(datetime.timezone.utc) + mock_ec2.describe_instances.return_value = { + "Reservations": [ + { + "Instances": [ + { + "InstanceId": "i-protected", + "State": {"Name": "running"}, + "LaunchTime": now, + "Tags": [ + {"Key": "Name", "Value": "protected-instance"}, + {"Key": "iit-billing-tag", "Value": "jenkins-cloud"}, + ], + } + ] + } + ] + } + + actions = cleanup_region("us-east-1") + + assert len(actions) == 0 + + @patch("aws_resource_cleanup.handler.boto3.client") + def test_cleanup_region_handles_exceptions(self, mock_boto_client): + """ + GIVEN an EC2 API error during describe_instances + WHEN cleanup_region is called + THEN it should handle the error gracefully and return empty list + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_ec2.describe_instances.side_effect = ClientError( + {"Error": {"Code": "RequestLimitExceeded", "Message": "Rate limit"}}, + "DescribeInstances", + ) + + actions = cleanup_region("us-east-1") + + assert actions == [] + + +@pytest.mark.integration +@pytest.mark.aws +@patch("aws_resource_cleanup.ec2.instances.boto3.client") +def test_unknown_action_returns_false(mock_boto_client): + from aws_resource_cleanup.models import CleanupAction + from aws_resource_cleanup.ec2.instances import execute_cleanup_action + + action = CleanupAction( + instance_id="i-unknown", + region="us-east-1", + name="test-instance", + action="PAUSE", + reason="Invalid", + days_overdue=0.0, + ) + + result = execute_cleanup_action(action, "us-east-1") + assert result is False + ec2 = mock_boto_client.return_value + ec2.terminate_instances.assert_not_called() + ec2.stop_instances.assert_not_called() + + +@pytest.mark.integration +@pytest.mark.aws +@pytest.mark.openshift +@patch("aws_resource_cleanup.ec2.instances.boto3.client") +@patch("aws_resource_cleanup.models.config.OPENSHIFT_CLEANUP_ENABLED", False) +@patch("aws_resource_cleanup.ec2.instances.DRY_RUN", False) +def test_terminate_openshift_cleanup_disabled_returns_true_no_calls(mock_boto_client): + from aws_resource_cleanup.models import CleanupAction + from aws_resource_cleanup.ec2.instances import execute_cleanup_action + + action = CleanupAction( + instance_id="i-openshift-disabled", + region="us-east-2", + name="openshift-master", + action="TERMINATE_OPENSHIFT_CLUSTER", + reason="TTL expired", + days_overdue=1.0, + cluster_name="test-openshift", + ) + + result = execute_cleanup_action(action, "us-east-2") + assert result is True + ec2 = mock_boto_client.return_value + ec2.terminate_instances.assert_not_called() + + +@pytest.mark.integration +@pytest.mark.aws +@pytest.mark.openshift +@patch("aws_resource_cleanup.ec2.instances.destroy_openshift_cluster") +@patch("aws_resource_cleanup.ec2.instances.detect_openshift_infra_id") +@patch("aws_resource_cleanup.ec2.instances.boto3.client") +@patch("aws_resource_cleanup.ec2.instances.DRY_RUN", False) +@patch("aws_resource_cleanup.models.config.OPENSHIFT_CLEANUP_ENABLED", True) +def test_terminate_openshift_infra_missing_still_terminates_instance( + mock_boto_client, mock_detect_infra, mock_destroy_cluster +): + from aws_resource_cleanup.models import CleanupAction + from aws_resource_cleanup.ec2.instances import execute_cleanup_action + + mock_detect_infra.return_value = None + action = CleanupAction( + instance_id="i-openshift-novpc", + region="us-east-2", + name="openshift-master", + action="TERMINATE_OPENSHIFT_CLUSTER", + reason="TTL expired", + days_overdue=2.0, + billing_tag="openshift", + cluster_name="test-openshift", + ) + + result = execute_cleanup_action(action, "us-east-2") + assert result is True + mock_detect_infra.assert_called_once_with("test-openshift", "us-east-2") + mock_destroy_cluster.assert_not_called() + ec2 = mock_boto_client.return_value + ec2.terminate_instances.assert_called_once_with(InstanceIds=["i-openshift-novpc"]) + + +@pytest.mark.integration +@pytest.mark.aws +@patch("aws_resource_cleanup.ec2.instances.boto3.resource") +def test_cirrus_ci_adds_billing_tag_when_missing(mock_boto_resource): + from aws_resource_cleanup.ec2.instances import cirrus_ci_add_iit_billing_tag + + mock_ec2_resource = MagicMock() + mock_instance = MagicMock() + mock_ec2_resource.Instance.return_value = mock_instance + mock_boto_resource.return_value = mock_ec2_resource + + instance = { + "InstanceId": "i-ccc123", + "Placement": {"AvailabilityZone": "us-east-1a"}, + } + tags_dict = {"CIRRUS_CI": "true", "Name": "ci-runner"} + + cirrus_ci_add_iit_billing_tag(instance, tags_dict) + + mock_boto_resource.assert_called_once_with("ec2", region_name="us-east-1") + mock_ec2_resource.Instance.assert_called_once_with("i-ccc123") + mock_instance.create_tags.assert_called_once() + assert { + "Key": "iit-billing-tag", + "Value": "CirrusCI", + } in mock_instance.create_tags.call_args.kwargs["Tags"] + + +@pytest.mark.integration +@pytest.mark.aws +@patch("aws_resource_cleanup.ec2.instances.boto3.resource") +def test_cirrus_ci_noop_if_tag_already_present(mock_boto_resource): + from aws_resource_cleanup.ec2.instances import cirrus_ci_add_iit_billing_tag + + instance = { + "InstanceId": "i-ccc456", + "Placement": {"AvailabilityZone": "us-east-1a"}, + } + tags_dict = {"CIRRUS_CI": "true", "iit-billing-tag": "CirrusCI"} + + cirrus_ci_add_iit_billing_tag(instance, tags_dict) + + mock_boto_resource.assert_not_called() + + +@pytest.mark.integration +@pytest.mark.aws +@patch("aws_resource_cleanup.handler.boto3.client") +def test_send_notification_publishes_when_topic_set(mock_boto_client): + from aws_resource_cleanup.handler import send_notification + from aws_resource_cleanup.models import CleanupAction + import aws_resource_cleanup.handler as handler_mod + + mock_sns = Mock() + mock_boto_client.return_value = mock_sns + + actions = [ + CleanupAction( + instance_id="i-1", + region="us-east-1", + name="one", + action="TERMINATE", + reason="test", + days_overdue=1.2, + billing_tag="x", + ) + ] + + original_topic = handler_mod.SNS_TOPIC_ARN + try: + handler_mod.SNS_TOPIC_ARN = "arn:aws:sns:us-east-1:123456789012:Topic" + send_notification(actions, "us-east-1") + finally: + handler_mod.SNS_TOPIC_ARN = original_topic + + assert mock_sns.publish.called + kwargs = mock_sns.publish.call_args.kwargs + assert len(kwargs["Subject"]) <= 100 + assert "Action:" in kwargs["Message"] + + +@pytest.mark.integration +@pytest.mark.aws +@patch("aws_resource_cleanup.handler.boto3.client") +def test_send_notification_skips_when_no_topic(mock_boto_client): + from aws_resource_cleanup.handler import send_notification + import aws_resource_cleanup.handler as handler_mod + + actions = [] + original_topic = handler_mod.SNS_TOPIC_ARN + try: + handler_mod.SNS_TOPIC_ARN = "" + send_notification(actions, "us-east-1") + finally: + handler_mod.SNS_TOPIC_ARN = original_topic + + mock_boto_client.assert_not_called() + + +# ===== Policy Priority Integration Tests ===== + + +@pytest.mark.integration +@pytest.mark.aws +@pytest.mark.policies +class TestPolicyPriorityInOrchestration: + """Test that cleanup_region respects policy priority order.""" + + @patch("aws_resource_cleanup.handler.execute_cleanup_action") + @patch("aws_resource_cleanup.handler.send_notification") + @patch("aws_resource_cleanup.handler.boto3.client") + def test_ttl_policy_takes_priority_over_untagged( + self, mock_boto_client, mock_send_notification, mock_execute + ): + """ + GIVEN instance with expired TTL AND no billing tag (matches both TTL and untagged) + WHEN cleanup_region is called + THEN TTL policy should be applied (not untagged) + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + now = datetime.datetime.now(datetime.timezone.utc) + two_hours_ago = now - datetime.timedelta(hours=2) + + # Instance with expired TTL (created 2 hours ago, TTL 1 hour) and no billing tag + mock_ec2.describe_instances.return_value = { + "Reservations": [ + { + "Instances": [ + { + "InstanceId": "i-ttl-untagged", + "State": {"Name": "running"}, + "LaunchTime": two_hours_ago, + "Tags": [ + {"Key": "Name", "Value": "test-instance"}, + { + "Key": "creation-time", + "Value": str(int((two_hours_ago).timestamp())), + }, + {"Key": "delete-cluster-after-hours", "Value": "1"}, + # No iit-billing-tag + ], + } + ] + } + ] + } + + mock_execute.return_value = True + + actions = cleanup_region("us-east-1") + + assert len(actions) == 1 + action = actions[0] + + # Should use TTL policy, not untagged + assert "TTL" in action.reason or "expired" in action.reason.lower() + assert "untagged" not in action.reason.lower() + + @patch("aws_resource_cleanup.handler.execute_cleanup_action") + @patch("aws_resource_cleanup.handler.send_notification") + @patch("aws_resource_cleanup.handler.boto3.client") + def test_stop_after_days_takes_priority_over_long_stopped( + self, mock_boto_client, mock_send_notification, mock_execute + ): + """ + GIVEN running instance with stop-after-days expired + WHEN cleanup_region is called + THEN STOP action should be created (not long-stopped which doesn't apply) + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + now = datetime.datetime.now(datetime.timezone.utc) + eight_days_ago = now - datetime.timedelta(days=8) + + # Running instance for 8 days with stop-after-days=7 + mock_ec2.describe_instances.return_value = { + "Reservations": [ + { + "Instances": [ + { + "InstanceId": "i-stop-after", + "State": {"Name": "running"}, + "LaunchTime": eight_days_ago, + "Tags": [ + {"Key": "Name", "Value": "pmm-staging"}, + {"Key": "iit-billing-tag", "Value": "pmm-staging"}, + {"Key": "stop-after-days", "Value": "7"}, + ], + } + ] + } + ] + } + + mock_execute.return_value = True + + actions = cleanup_region("us-east-1") + + assert len(actions) == 1 + action = actions[0] + + # Should be STOP action (not TERMINATE from long-stopped) + assert action.action == "STOP" + assert "stop" in action.reason.lower() + + @patch("aws_resource_cleanup.handler.execute_cleanup_action") + @patch("aws_resource_cleanup.handler.send_notification") + @patch("aws_resource_cleanup.handler.boto3.client") + def test_long_stopped_takes_priority_over_untagged( + self, mock_boto_client, mock_send_notification, mock_execute + ): + """ + GIVEN stopped instance for 35 days without billing tag + WHEN cleanup_region is called + THEN long-stopped policy should be applied (not untagged) + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + now = datetime.datetime.now(datetime.timezone.utc) + thirty_five_days_ago = now - datetime.timedelta(days=35) + + # Stopped for 35 days, no billing tag + mock_ec2.describe_instances.return_value = { + "Reservations": [ + { + "Instances": [ + { + "InstanceId": "i-long-stopped", + "State": {"Name": "stopped"}, + "LaunchTime": thirty_five_days_ago, + "Tags": [ + {"Key": "Name", "Value": "old-stopped"}, + # No billing tag + ], + } + ] + } + ] + } + + mock_execute.return_value = True + + actions = cleanup_region("us-east-1") + + assert len(actions) == 1 + action = actions[0] + + # Should use long-stopped policy + assert "stopped" in action.reason.lower() + assert "30 days" in action.reason or "long" in action.reason.lower() + + @patch("aws_resource_cleanup.handler.execute_cleanup_action") + @patch("aws_resource_cleanup.handler.send_notification") + @patch("aws_resource_cleanup.handler.boto3.client") + def test_multiple_instances_with_different_policies( + self, mock_boto_client, mock_send_notification, mock_execute + ): + """ + GIVEN multiple instances matching different policies + WHEN cleanup_region is called + THEN each instance should get correct policy based on priority + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + now = datetime.datetime.now(datetime.timezone.utc) + + # Three instances with different policy matches + mock_ec2.describe_instances.return_value = { + "Reservations": [ + { + "Instances": [ + # Instance 1: TTL expired + { + "InstanceId": "i-ttl", + "State": {"Name": "running"}, + "LaunchTime": now - datetime.timedelta(hours=2), + "Tags": [ + {"Key": "Name", "Value": "ttl-instance"}, + { + "Key": "creation-time", + "Value": str( + int( + ( + now - datetime.timedelta(hours=2) + ).timestamp() + ) + ), + }, + {"Key": "delete-cluster-after-hours", "Value": "1"}, + {"Key": "iit-billing-tag", "Value": "test"}, + ], + }, + # Instance 2: Long stopped (no billing tag, stopped for 35 days) + { + "InstanceId": "i-stopped", + "State": {"Name": "stopped"}, + "LaunchTime": now - datetime.timedelta(days=35), + "Tags": [ + {"Key": "Name", "Value": "long-stopped"}, + # No billing tag - will trigger long-stopped policy + ], + }, + # Instance 3: Untagged + { + "InstanceId": "i-untagged", + "State": {"Name": "running"}, + "LaunchTime": now - datetime.timedelta(hours=2), + "Tags": [{"Key": "Name", "Value": "untagged"}], + }, + ] + } + ] + } + + mock_execute.return_value = True + + actions = cleanup_region("us-east-1") + + assert len(actions) == 3 + + # Verify each action has correct policy applied + actions_by_id = {action.instance_id: action for action in actions} + + # TTL instance should have TTL reason + ttl_action = actions_by_id["i-ttl"] + assert "TTL" in ttl_action.reason or "expired" in ttl_action.reason.lower() + + # Stopped instance should have long-stopped reason + stopped_action = actions_by_id["i-stopped"] + assert "stopped" in stopped_action.reason.lower() + + # Untagged instance should have missing billing tag reason + untagged_action = actions_by_id["i-untagged"] + assert ( + "missing" in untagged_action.reason.lower() + or "billing tag" in untagged_action.reason.lower() + ) + + @patch("aws_resource_cleanup.handler.execute_cleanup_action") + @patch("aws_resource_cleanup.handler.send_notification") + @patch("aws_resource_cleanup.handler.boto3.client") + def test_reordered_policies_would_fail_this_test( + self, mock_boto_client, mock_send_notification, mock_execute + ): + """ + GIVEN instance matching both TTL and untagged policies + WHEN cleanup_region is called + THEN if policies were reordered, this test would catch it + + This test documents that policy order matters and must be maintained. + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + now = datetime.datetime.now(datetime.timezone.utc) + two_hours_ago = now - datetime.timedelta(hours=2) + + # Instance with expired TTL and no billing tag + mock_ec2.describe_instances.return_value = { + "Reservations": [ + { + "Instances": [ + { + "InstanceId": "i-dual-match", + "State": {"Name": "running"}, + "LaunchTime": two_hours_ago, + "Tags": [ + {"Key": "Name", "Value": "test"}, + { + "Key": "creation-time", + "Value": str(int(two_hours_ago.timestamp())), + }, + {"Key": "delete-cluster-after-hours", "Value": "1"}, + # No billing tag - matches untagged policy too + ], + } + ] + } + ] + } + + mock_execute.return_value = True + + actions = cleanup_region("us-east-1") + + # The action MUST be from TTL policy (first in priority) + # If someone reorders the policies in handler.py (lines 96-100), + # this test will fail, alerting them to the priority requirement + assert len(actions) == 1 + action = actions[0] + + # Explicit check: must NOT be untagged reason + assert "untagged" not in action.reason.lower(), ( + "TTL policy should take priority over untagged policy. " + "If this fails, check that check_ttl_expiration is called " + "before check_untagged in handler.py cleanup_region()" + ) diff --git a/IaC/cdk/aws-resources-cleanup/tests/integration/test_openshift_orchestration.py b/IaC/cdk/aws-resources-cleanup/tests/integration/test_openshift_orchestration.py new file mode 100644 index 0000000000..1854d201de --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/integration/test_openshift_orchestration.py @@ -0,0 +1,173 @@ +"""Integration tests for OpenShift orchestration. + +Tests the destroy_openshift_cluster orchestrator with mocked AWS clients. +Single-pass cleanup with EventBridge retry handling. +""" + +from __future__ import annotations +import pytest +from unittest.mock import Mock, patch + +from aws_resource_cleanup.openshift.orchestrator import destroy_openshift_cluster + + +@pytest.mark.integration +@pytest.mark.openshift +class TestDestroyOpenshiftCluster: + """Test OpenShift cluster destruction orchestration.""" + + @patch("aws_resource_cleanup.openshift.orchestrator.cleanup_s3_state") + @patch("aws_resource_cleanup.openshift.orchestrator.cleanup_route53_records") + @patch("aws_resource_cleanup.openshift.orchestrator.delete_vpc") + @patch("aws_resource_cleanup.openshift.orchestrator.delete_internet_gateway") + @patch("aws_resource_cleanup.openshift.orchestrator.delete_route_tables") + @patch("aws_resource_cleanup.openshift.orchestrator.delete_subnets") + @patch("aws_resource_cleanup.openshift.orchestrator.delete_security_groups") + @patch("aws_resource_cleanup.openshift.orchestrator.delete_vpc_endpoints") + @patch("aws_resource_cleanup.openshift.orchestrator.cleanup_network_interfaces") + @patch("aws_resource_cleanup.openshift.orchestrator.release_elastic_ips") + @patch("aws_resource_cleanup.openshift.orchestrator.delete_nat_gateways") + @patch("aws_resource_cleanup.openshift.orchestrator.delete_load_balancers") + @patch("aws_resource_cleanup.openshift.orchestrator.boto3.client") + @patch("aws_resource_cleanup.models.config.DRY_RUN", False) + def test_orchestrator_calls_functions_in_correct_order( + self, + mock_boto_client, + mock_delete_lbs, + mock_delete_nats, + mock_release_eips, + mock_cleanup_enis, + mock_delete_endpoints, + mock_delete_sgs, + mock_delete_subnets, + mock_delete_rts, + mock_delete_igw, + mock_delete_vpc, + mock_cleanup_route53, + mock_cleanup_s3, + ): + """ + GIVEN OpenShift cluster exists and VPC can be deleted + WHEN destroy_openshift_cluster is called + THEN resources should be deleted in dependency order in single pass + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + # VPC exists and can be deleted + mock_ec2.describe_vpcs.return_value = {"Vpcs": [{"VpcId": "vpc-abc123"}]} + mock_delete_vpc.return_value = True # VPC successfully deleted + + result = destroy_openshift_cluster("test-cluster", "test-infra-123", "us-east-1") + + # Verify single-pass cleanup + assert result is True + + # Verify cleanup functions called once in correct order + mock_delete_lbs.assert_called_once_with("test-infra-123", "us-east-1") + mock_delete_nats.assert_called_once_with("test-infra-123", "us-east-1") + mock_release_eips.assert_called_once_with("test-infra-123", "us-east-1") + mock_cleanup_enis.assert_called_once_with("vpc-abc123", "us-east-1") + mock_delete_endpoints.assert_called_once_with("vpc-abc123", "us-east-1") + mock_delete_sgs.assert_called_once_with("vpc-abc123", "us-east-1") + mock_delete_subnets.assert_called_once_with("vpc-abc123", "us-east-1") + mock_delete_rts.assert_called_once_with("vpc-abc123", "us-east-1") + mock_delete_igw.assert_called_once_with("vpc-abc123", "us-east-1") + mock_delete_vpc.assert_called_once_with("vpc-abc123", "us-east-1") + + # Route53 and S3 cleanup when VPC successfully deleted + mock_cleanup_route53.assert_called_once_with("test-cluster", "us-east-1") + mock_cleanup_s3.assert_called_once_with("test-cluster", "us-east-1") + + @patch("aws_resource_cleanup.openshift.orchestrator.cleanup_s3_state") + @patch("aws_resource_cleanup.openshift.orchestrator.cleanup_route53_records") + @patch("aws_resource_cleanup.openshift.orchestrator.delete_vpc") + @patch("aws_resource_cleanup.openshift.orchestrator.delete_load_balancers") + @patch("aws_resource_cleanup.openshift.orchestrator.boto3.client") + def test_orchestrator_exits_early_when_vpc_not_found( + self, mock_boto_client, mock_delete_lbs, mock_delete_vpc, mock_cleanup_route53, mock_cleanup_s3 + ): + """ + GIVEN VPC does not exist + WHEN destroy_openshift_cluster is called + THEN cleanup should exit early and clean up Route53/S3 + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + mock_ec2.describe_vpcs.return_value = {"Vpcs": []} + + result = destroy_openshift_cluster("test-cluster", "test-infra-123", "us-east-1") + + # Should check VPC once and exit + mock_ec2.describe_vpcs.assert_called_once() + mock_delete_lbs.assert_not_called() + mock_delete_vpc.assert_not_called() + + # Should clean up Route53/S3 when VPC is already gone + mock_cleanup_route53.assert_called_once_with("test-cluster", "us-east-1") + mock_cleanup_s3.assert_called_once_with("test-cluster", "us-east-1") + + assert result is True # Cleanup complete + + @patch("aws_resource_cleanup.openshift.orchestrator.cleanup_s3_state") + @patch("aws_resource_cleanup.openshift.orchestrator.cleanup_route53_records") + @patch("aws_resource_cleanup.openshift.orchestrator.delete_vpc") + @patch("aws_resource_cleanup.openshift.orchestrator.delete_load_balancers") + @patch("aws_resource_cleanup.openshift.orchestrator.boto3.client") + def test_vpc_has_dependencies_returns_false( + self, + mock_boto_client, + mock_delete_lbs, + mock_delete_vpc, + mock_cleanup_route53, + mock_cleanup_s3, + ): + """ + GIVEN VPC still has dependencies and cannot be deleted + WHEN destroy_openshift_cluster is called + THEN cleanup should return False and Route53/S3 should NOT be cleaned + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + # VPC exists but has dependencies + mock_ec2.describe_vpcs.return_value = {"Vpcs": [{"VpcId": "vpc-abc123"}]} + mock_delete_vpc.return_value = False # VPC has dependencies + + result = destroy_openshift_cluster("test-cluster", "test-infra-123", "us-east-1") + + # Should return False (cleanup incomplete) + assert result is False + + # Route53 and S3 should NOT be cleaned when VPC deletion fails + mock_cleanup_route53.assert_not_called() + mock_cleanup_s3.assert_not_called() + + @patch("aws_resource_cleanup.openshift.orchestrator.delete_vpc") + @patch("aws_resource_cleanup.openshift.orchestrator.delete_load_balancers") + @patch("aws_resource_cleanup.openshift.orchestrator.boto3.client") + def test_orchestrator_handles_dependency_violations( + self, mock_boto_client, mock_delete_lbs, mock_delete_vpc + ): + """ + GIVEN DependencyViolation error occurs during cleanup + WHEN destroy_openshift_cluster is called + THEN should return False and rely on EventBridge retry + """ + from botocore.exceptions import ClientError + + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + # VPC exists + mock_ec2.describe_vpcs.return_value = {"Vpcs": [{"VpcId": "vpc-abc123"}]} + + # Simulate DependencyViolation error + error_response = {"Error": {"Code": "DependencyViolation"}} + mock_delete_lbs.side_effect = ClientError(error_response, "DeleteLoadBalancer") + + result = destroy_openshift_cluster("test-cluster", "test-infra-123", "us-east-1") + + # Should return False (dependencies remain) + assert result is False diff --git a/IaC/cdk/aws-resources-cleanup/tests/pytest.ini b/IaC/cdk/aws-resources-cleanup/tests/pytest.ini new file mode 100644 index 0000000000..e731cd7e4c --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/pytest.ini @@ -0,0 +1,28 @@ +[pytest] +testpaths = . +python_files = test_*.py +python_classes = Test* +python_functions = test_* + +# Test markers +markers = + unit: Unit tests (fast, isolated business logic) + integration: Integration tests (component interactions) + e2e: End-to-end tests (full workflows) + aws: Tests interacting with AWS services + slow: Slow-running tests (>1s) + openshift: OpenShift-specific functionality + eks: EKS-specific functionality + policies: Cleanup policy tests + smoke: Critical path smoke tests + volumes: EBS volume cleanup tests + +# Output options +addopts = + -v + --strict-markers + --tb=short + --disable-warnings + +# Optional: Uncomment to enable coverage reporting +# addopts = --cov=aws_resource_cleanup --cov-report=term-missing --cov-report=html \ No newline at end of file diff --git a/IaC/cdk/aws-resources-cleanup/tests/unit/__init__.py b/IaC/cdk/aws-resources-cleanup/tests/unit/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/IaC/cdk/aws-resources-cleanup/tests/unit/conftest.py b/IaC/cdk/aws-resources-cleanup/tests/unit/conftest.py new file mode 100644 index 0000000000..2d8b5d5c09 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/unit/conftest.py @@ -0,0 +1,9 @@ +"""Fixtures specific to unit tests.""" + +import pytest + + +@pytest.fixture(autouse=True) +def _mark_as_unit(request): + """Automatically mark all tests in unit/ as unit tests.""" + request.node.add_marker(pytest.mark.unit) \ No newline at end of file diff --git a/IaC/cdk/aws-resources-cleanup/tests/unit/eks/test_eks_cloudformation.py b/IaC/cdk/aws-resources-cleanup/tests/unit/eks/test_eks_cloudformation.py new file mode 100644 index 0000000000..ac75ee303f --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/unit/eks/test_eks_cloudformation.py @@ -0,0 +1,452 @@ +"""Unit tests for EKS CloudFormation stack operations.""" + +from __future__ import annotations +import pytest +from unittest.mock import Mock, patch +from botocore.exceptions import ClientError + +from aws_resource_cleanup.eks.cloudformation import ( + get_eks_cloudformation_billing_tag, + cleanup_failed_stack_resources, + delete_eks_cluster_stack, +) + + +@pytest.mark.unit +@pytest.mark.eks +class TestGetEksCloudformationBillingTag: + """Test CloudFormation stack billing tag retrieval.""" + + @patch("aws_resource_cleanup.eks.cloudformation.boto3.client") + def test_retrieves_billing_tag_from_stack(self, mock_boto_client): + """ + GIVEN CloudFormation stack exists with iit-billing-tag + WHEN get_eks_cloudformation_billing_tag is called + THEN billing tag should be returned + """ + mock_cfn = Mock() + mock_boto_client.return_value = mock_cfn + + mock_cfn.describe_stacks.return_value = { + "Stacks": [ + { + "StackName": "eksctl-test-cluster-cluster", + "Tags": [ + {"Key": "iit-billing-tag", "Value": "eks-team"}, + {"Key": "Environment", "Value": "test"}, + ], + } + ] + } + + billing_tag = get_eks_cloudformation_billing_tag("test-cluster", "us-east-1") + + assert billing_tag == "eks-team" + mock_cfn.describe_stacks.assert_called_once_with( + StackName="eksctl-test-cluster-cluster" + ) + + @patch("aws_resource_cleanup.eks.cloudformation.boto3.client") + def test_returns_none_when_stack_not_found(self, mock_boto_client): + """ + GIVEN CloudFormation stack does not exist + WHEN get_eks_cloudformation_billing_tag is called + THEN None should be returned + """ + mock_cfn = Mock() + mock_boto_client.return_value = mock_cfn + + mock_cfn.describe_stacks.side_effect = ClientError( + {"Error": {"Code": "ValidationError", "Message": "Stack does not exist"}}, + "DescribeStacks", + ) + + billing_tag = get_eks_cloudformation_billing_tag("nonexistent", "us-east-1") + + assert billing_tag is None + + @patch("aws_resource_cleanup.eks.cloudformation.boto3.client") + def test_returns_none_when_no_billing_tag(self, mock_boto_client): + """ + GIVEN CloudFormation stack exists without iit-billing-tag + WHEN get_eks_cloudformation_billing_tag is called + THEN None should be returned + """ + mock_cfn = Mock() + mock_boto_client.return_value = mock_cfn + + mock_cfn.describe_stacks.return_value = { + "Stacks": [ + { + "StackName": "eksctl-test-cluster-cluster", + "Tags": [{"Key": "Environment", "Value": "test"}], + } + ] + } + + billing_tag = get_eks_cloudformation_billing_tag("test-cluster", "us-east-1") + + assert billing_tag is None + + +@pytest.mark.unit +@pytest.mark.eks +class TestCleanupFailedStackResources: + """Test manual cleanup of failed stack resources.""" + + @patch("aws_resource_cleanup.eks.cloudformation.boto3.client") + def test_cleans_up_security_group_ingress_rules(self, mock_boto_client): + """ + GIVEN stack with DELETE_FAILED security group ingress + WHEN cleanup_failed_stack_resources is called + THEN ingress rules should be revoked + """ + mock_cfn = Mock() + mock_ec2 = Mock() + + def client_factory(service_name, **kwargs): + if service_name == "cloudformation": + return mock_cfn + elif service_name == "ec2": + return mock_ec2 + return Mock() + + mock_boto_client.side_effect = client_factory + + mock_cfn.describe_stack_events.return_value = { + "StackEvents": [ + { + "LogicalResourceId": "SecurityGroupIngress", + "ResourceType": "AWS::EC2::SecurityGroupIngress", + "ResourceStatus": "DELETE_FAILED", + "PhysicalResourceId": "sg-abc123|tcp|443|0.0.0.0/0", + } + ] + } + + mock_ec2.describe_security_groups.return_value = { + "SecurityGroups": [ + { + "GroupId": "sg-abc123", + "IpPermissions": [ + { + "IpProtocol": "tcp", + "FromPort": 443, + "ToPort": 443, + "IpRanges": [{"CidrIp": "0.0.0.0/0"}], + } + ], + } + ] + } + + result = cleanup_failed_stack_resources("test-stack", "us-east-1") + + assert result is True + mock_ec2.revoke_security_group_ingress.assert_called_once() + + @patch("aws_resource_cleanup.eks.cloudformation.boto3.client") + def test_disassociates_route_table(self, mock_boto_client): + """ + GIVEN stack with DELETE_FAILED route table association + WHEN cleanup_failed_stack_resources is called + THEN association should be removed + """ + mock_cfn = Mock() + mock_ec2 = Mock() + + def client_factory(service_name, **kwargs): + if service_name == "cloudformation": + return mock_cfn + elif service_name == "ec2": + return mock_ec2 + return Mock() + + mock_boto_client.side_effect = client_factory + + mock_cfn.describe_stack_events.return_value = { + "StackEvents": [ + { + "LogicalResourceId": "RouteTableAssociation", + "ResourceType": "AWS::EC2::SubnetRouteTableAssociation", + "ResourceStatus": "DELETE_FAILED", + "PhysicalResourceId": "rtbassoc-abc123", + } + ] + } + + result = cleanup_failed_stack_resources("test-stack", "us-east-1") + + assert result is True + mock_ec2.disassociate_route_table.assert_called_once_with( + AssociationId="rtbassoc-abc123" + ) + + @patch("aws_resource_cleanup.eks.cloudformation.boto3.client") + def test_deletes_route(self, mock_boto_client): + """ + GIVEN stack with DELETE_FAILED route + WHEN cleanup_failed_stack_resources is called + THEN route should be deleted + """ + mock_cfn = Mock() + mock_ec2 = Mock() + + def client_factory(service_name, **kwargs): + if service_name == "cloudformation": + return mock_cfn + elif service_name == "ec2": + return mock_ec2 + return Mock() + + mock_boto_client.side_effect = client_factory + + mock_cfn.describe_stack_events.return_value = { + "StackEvents": [ + { + "LogicalResourceId": "Route", + "ResourceType": "AWS::EC2::Route", + "ResourceStatus": "DELETE_FAILED", + "PhysicalResourceId": "rtb-abc123_10.0.0.0/16", + } + ] + } + + result = cleanup_failed_stack_resources("test-stack", "us-east-1") + + assert result is True + mock_ec2.delete_route.assert_called_once_with( + RouteTableId="rtb-abc123", DestinationCidrBlock="10.0.0.0/16" + ) + + @patch("aws_resource_cleanup.eks.cloudformation.boto3.client") + def test_returns_true_when_no_failed_resources(self, mock_boto_client): + """ + GIVEN stack with no DELETE_FAILED resources + WHEN cleanup_failed_stack_resources is called + THEN True should be returned without cleanup attempts + """ + mock_cfn = Mock() + mock_boto_client.return_value = mock_cfn + + mock_cfn.describe_stack_events.return_value = { + "StackEvents": [ + { + "LogicalResourceId": "VPC", + "ResourceType": "AWS::EC2::VPC", + "ResourceStatus": "DELETE_COMPLETE", + } + ] + } + + result = cleanup_failed_stack_resources("test-stack", "us-east-1") + + assert result is True + + @patch("aws_resource_cleanup.eks.cloudformation.boto3.client") + def test_handles_resource_not_found_gracefully(self, mock_boto_client): + """ + GIVEN failed resource that no longer exists + WHEN cleanup_failed_stack_resources is called + THEN error should be handled gracefully + """ + mock_cfn = Mock() + mock_ec2 = Mock() + + def client_factory(service_name, **kwargs): + if service_name == "cloudformation": + return mock_cfn + elif service_name == "ec2": + return mock_ec2 + return Mock() + + mock_boto_client.side_effect = client_factory + + mock_cfn.describe_stack_events.return_value = { + "StackEvents": [ + { + "LogicalResourceId": "SecurityGroupIngress", + "ResourceType": "AWS::EC2::SecurityGroupIngress", + "ResourceStatus": "DELETE_FAILED", + "PhysicalResourceId": "sg-nonexistent|tcp|443", + } + ] + } + + mock_ec2.describe_security_groups.side_effect = ClientError( + {"Error": {"Code": "InvalidGroup.NotFound"}}, "DescribeSecurityGroups" + ) + + # Should not raise exception + result = cleanup_failed_stack_resources("test-stack", "us-east-1") + + assert result is True + + +@pytest.mark.unit +@pytest.mark.eks +class TestDeleteEksClusterStack: + """Test EKS cluster CloudFormation stack deletion.""" + + @patch("aws_resource_cleanup.eks.cloudformation.boto3.client") + @patch("aws_resource_cleanup.eks.cloudformation.DRY_RUN", False) + def test_deletes_stack_in_create_complete_state(self, mock_boto_client): + """ + GIVEN stack in CREATE_COMPLETE state + WHEN delete_eks_cluster_stack is called in live mode + THEN stack deletion should be initiated + """ + mock_cfn = Mock() + mock_boto_client.return_value = mock_cfn + + mock_cfn.describe_stacks.return_value = { + "Stacks": [{"StackStatus": "CREATE_COMPLETE"}] + } + + result = delete_eks_cluster_stack("test-cluster", "us-east-1") + + assert result is True + mock_cfn.delete_stack.assert_called_once_with( + StackName="eksctl-test-cluster-cluster" + ) + + @patch("aws_resource_cleanup.eks.cloudformation.boto3.client") + @patch("aws_resource_cleanup.eks.cloudformation.DRY_RUN", True) + def test_skips_deletion_in_dry_run_mode(self, mock_boto_client): + """ + GIVEN stack exists + WHEN delete_eks_cluster_stack is called in DRY_RUN mode + THEN no deletion should occur + """ + mock_cfn = Mock() + mock_boto_client.return_value = mock_cfn + + mock_cfn.describe_stacks.return_value = { + "Stacks": [{"StackStatus": "CREATE_COMPLETE"}] + } + + result = delete_eks_cluster_stack("test-cluster", "us-east-1") + + assert result is True + mock_cfn.delete_stack.assert_not_called() + + @patch("aws_resource_cleanup.eks.cloudformation.cleanup_failed_stack_resources") + @patch("aws_resource_cleanup.eks.cloudformation.boto3.client") + @patch("aws_resource_cleanup.eks.cloudformation.DRY_RUN", False) + def test_retries_deletion_for_delete_failed_stack( + self, mock_boto_client, mock_cleanup_resources + ): + """ + GIVEN stack in DELETE_FAILED state + WHEN delete_eks_cluster_stack is called + THEN failed resources should be cleaned up and deletion retried + """ + mock_cfn = Mock() + mock_boto_client.return_value = mock_cfn + + mock_cfn.describe_stacks.return_value = { + "Stacks": [{"StackStatus": "DELETE_FAILED"}] + } + mock_cleanup_resources.return_value = True + + result = delete_eks_cluster_stack("test-cluster", "us-east-1") + + assert result is True + mock_cleanup_resources.assert_called_once_with( + "eksctl-test-cluster-cluster", "us-east-1" + ) + mock_cfn.delete_stack.assert_called_once() + + @patch("aws_resource_cleanup.eks.cloudformation.boto3.client") + def test_returns_true_for_already_deleting_stack(self, mock_boto_client): + """ + GIVEN stack in DELETE_IN_PROGRESS state + WHEN delete_eks_cluster_stack is called + THEN True should be returned without initiating new deletion + """ + mock_cfn = Mock() + mock_boto_client.return_value = mock_cfn + + mock_cfn.describe_stacks.return_value = { + "Stacks": [{"StackStatus": "DELETE_IN_PROGRESS"}] + } + + result = delete_eks_cluster_stack("test-cluster", "us-east-1") + + assert result is True + mock_cfn.delete_stack.assert_not_called() + + @patch("aws_resource_cleanup.eks.cloudformation.boto3.client") + def test_returns_false_when_stack_not_found(self, mock_boto_client): + """ + GIVEN stack does not exist + WHEN delete_eks_cluster_stack is called + THEN False should be returned + """ + mock_cfn = Mock() + mock_boto_client.return_value = mock_cfn + + mock_cfn.describe_stacks.side_effect = ClientError( + {"Error": {"Code": "ValidationError", "Message": "does not exist"}}, + "DescribeStacks", + ) + + result = delete_eks_cluster_stack("nonexistent", "us-east-1") + + assert result is False + mock_cfn.delete_stack.assert_not_called() + + @patch("aws_resource_cleanup.eks.cloudformation.boto3.client") + def test_handles_unexpected_errors(self, mock_boto_client): + """ + GIVEN unexpected AWS API error + WHEN delete_eks_cluster_stack is called + THEN error should be handled and False returned + """ + mock_cfn = Mock() + mock_boto_client.return_value = mock_cfn + + mock_cfn.describe_stacks.side_effect = Exception("Unexpected error") + + result = delete_eks_cluster_stack("test-cluster", "us-east-1") + + assert result is False + + @patch("aws_resource_cleanup.eks.cloudformation.boto3.client") + def test_constructs_correct_stack_name(self, mock_boto_client): + """ + GIVEN cluster name + WHEN delete_eks_cluster_stack is called + THEN correct eksctl stack name should be used + """ + mock_cfn = Mock() + mock_boto_client.return_value = mock_cfn + + mock_cfn.describe_stacks.return_value = { + "Stacks": [{"StackStatus": "CREATE_COMPLETE"}] + } + + delete_eks_cluster_stack("my-eks-cluster", "us-west-2") + + # Should use eksctl naming convention + mock_cfn.describe_stacks.assert_called_with( + StackName="eksctl-my-eks-cluster-cluster" + ) + + @patch("aws_resource_cleanup.eks.cloudformation.boto3.client") + def test_handles_rollback_complete_state(self, mock_boto_client): + """ + GIVEN stack in ROLLBACK_COMPLETE state + WHEN delete_eks_cluster_stack is called + THEN deletion should proceed normally + """ + mock_cfn = Mock() + mock_boto_client.return_value = mock_cfn + + mock_cfn.describe_stacks.return_value = { + "Stacks": [{"StackStatus": "ROLLBACK_COMPLETE"}] + } + + result = delete_eks_cluster_stack("test-cluster", "us-east-1") + + assert result is True diff --git a/IaC/cdk/aws-resources-cleanup/tests/unit/openshift/__init__.py b/IaC/cdk/aws-resources-cleanup/tests/unit/openshift/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/IaC/cdk/aws-resources-cleanup/tests/unit/openshift/test_openshift_detection.py b/IaC/cdk/aws-resources-cleanup/tests/unit/openshift/test_openshift_detection.py new file mode 100644 index 0000000000..ae3e6e688f --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/unit/openshift/test_openshift_detection.py @@ -0,0 +1,154 @@ +"""Unit tests for OpenShift cluster detection.""" + +from __future__ import annotations +import pytest +from unittest.mock import Mock, patch + +from aws_resource_cleanup.openshift.detection import detect_openshift_infra_id + + +@pytest.mark.unit +@pytest.mark.openshift +class TestDetectOpenshiftInfraId: + """Test OpenShift infrastructure ID detection.""" + + @patch("aws_resource_cleanup.openshift.detection.boto3.client") + def test_detects_infra_id_from_exact_match(self, mock_boto_client): + """ + GIVEN VPC with exact cluster name tag + WHEN detect_openshift_infra_id is called + THEN infrastructure ID should be extracted from tag + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + mock_ec2.describe_vpcs.return_value = { + "Vpcs": [ + { + "VpcId": "vpc-abc123", + "Tags": [ + { + "Key": "kubernetes.io/cluster/test-infra-abc123", + "Value": "owned", + }, + {"Key": "Name", "Value": "openshift-vpc"}, + ], + } + ] + } + + infra_id = detect_openshift_infra_id("test-infra-abc123", "us-east-1") + + assert infra_id == "test-infra-abc123" + mock_ec2.describe_vpcs.assert_called_once_with( + Filters=[ + { + "Name": "tag-key", + "Values": ["kubernetes.io/cluster/test-infra-abc123"], + } + ] + ) + + @patch("aws_resource_cleanup.openshift.detection.boto3.client") + def test_detects_infra_id_from_wildcard_match(self, mock_boto_client): + """ + GIVEN VPC with cluster name prefix (wildcard match needed) + WHEN detect_openshift_infra_id is called with cluster name + THEN infrastructure ID should be extracted from tag + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + # First call returns empty (exact match fails) + # Second call returns VPC (wildcard match succeeds) + mock_ec2.describe_vpcs.side_effect = [ + {"Vpcs": []}, + { + "Vpcs": [ + { + "VpcId": "vpc-def456", + "Tags": [ + { + "Key": "kubernetes.io/cluster/test-cluster-xyz789", + "Value": "owned", + } + ], + } + ] + }, + ] + + infra_id = detect_openshift_infra_id("test-cluster", "us-east-1") + + assert infra_id == "test-cluster-xyz789" + assert mock_ec2.describe_vpcs.call_count == 2 + + # First call: exact match + first_call = mock_ec2.describe_vpcs.call_args_list[0] + assert first_call.kwargs["Filters"][0]["Values"] == [ + "kubernetes.io/cluster/test-cluster" + ] + + # Second call: wildcard match + second_call = mock_ec2.describe_vpcs.call_args_list[1] + assert second_call.kwargs["Filters"][0]["Values"] == [ + "kubernetes.io/cluster/test-cluster-*" + ] + + @patch("aws_resource_cleanup.openshift.detection.boto3.client") + def test_returns_none_when_no_vpc_found(self, mock_boto_client): + """ + GIVEN no VPC exists with cluster tags + WHEN detect_openshift_infra_id is called + THEN None should be returned + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + mock_ec2.describe_vpcs.return_value = {"Vpcs": []} + + infra_id = detect_openshift_infra_id("nonexistent-cluster", "us-east-1") + + assert infra_id is None + + @patch("aws_resource_cleanup.openshift.detection.boto3.client") + def test_returns_none_when_vpc_has_no_cluster_tags(self, mock_boto_client): + """ + GIVEN VPC exists but has no kubernetes cluster tags + WHEN detect_openshift_infra_id is called + THEN None should be returned + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + mock_ec2.describe_vpcs.return_value = { + "Vpcs": [ + { + "VpcId": "vpc-abc123", + "Tags": [ + {"Key": "Name", "Value": "regular-vpc"}, + {"Key": "Environment", "Value": "test"}, + ], + } + ] + } + + infra_id = detect_openshift_infra_id("test-cluster", "us-east-1") + + assert infra_id is None + + @patch("aws_resource_cleanup.openshift.detection.boto3.client") + def test_handles_aws_api_exception(self, mock_boto_client): + """ + GIVEN AWS API raises exception + WHEN detect_openshift_infra_id is called + THEN exception should be handled and None returned + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + mock_ec2.describe_vpcs.side_effect = Exception("AWS API Error") + + infra_id = detect_openshift_infra_id("test-cluster", "us-east-1") + + assert infra_id is None diff --git a/IaC/cdk/aws-resources-cleanup/tests/unit/openshift/test_openshift_dns.py b/IaC/cdk/aws-resources-cleanup/tests/unit/openshift/test_openshift_dns.py new file mode 100644 index 0000000000..2b86c2eda6 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/unit/openshift/test_openshift_dns.py @@ -0,0 +1,147 @@ +"""Unit tests for OpenShift Route53 DNS cleanup.""" + +from __future__ import annotations +import pytest +from unittest.mock import Mock, patch + +from aws_resource_cleanup.openshift.dns import cleanup_route53_records + + +@pytest.mark.unit +@pytest.mark.openshift +class TestCleanupRoute53Records: + """Test Route53 DNS record cleanup for OpenShift clusters.""" + + @patch("aws_resource_cleanup.openshift.dns.boto3.client") + @patch("aws_resource_cleanup.openshift.dns.OPENSHIFT_BASE_DOMAIN", "cd.percona.com") + @patch("aws_resource_cleanup.openshift.dns.DRY_RUN", False) + def test_deletes_cluster_dns_records_live_mode(self, mock_boto_client): + """ + GIVEN OpenShift cluster DNS records exist in Route53 + WHEN cleanup_route53_records is called in live mode + THEN matching DNS records should be deleted + """ + mock_route53 = Mock() + mock_boto_client.return_value = mock_route53 + + mock_route53.list_hosted_zones.return_value = { + "HostedZones": [{"Id": "/hostedzone/Z123456", "Name": "cd.percona.com."}] + } + + mock_route53.list_resource_record_sets.return_value = { + "ResourceRecordSets": [ + { + "Name": "api.test-cluster.cd.percona.com.", + "Type": "A", + "TTL": 300, + "ResourceRecords": [{"Value": "1.2.3.4"}], + }, + { + "Name": "*.apps.test-cluster.cd.percona.com.", + "Type": "A", + "TTL": 300, + "ResourceRecords": [{"Value": "5.6.7.8"}], + }, + { + "Name": "other.cd.percona.com.", + "Type": "A", + "TTL": 300, + "ResourceRecords": [{"Value": "9.10.11.12"}], + }, + ] + } + + cleanup_route53_records("test-cluster", "us-east-1") + + mock_route53.list_hosted_zones.assert_called_once() + mock_route53.list_resource_record_sets.assert_called_once_with( + HostedZoneId="Z123456" + ) + + # Should delete 2 records (api and apps) but not the other one + call_args = mock_route53.change_resource_record_sets.call_args + assert call_args is not None + changes = call_args.kwargs["ChangeBatch"]["Changes"] + assert len(changes) == 2 + assert all(change["Action"] == "DELETE" for change in changes) + + @patch("aws_resource_cleanup.openshift.dns.boto3.client") + @patch("aws_resource_cleanup.openshift.dns.OPENSHIFT_BASE_DOMAIN", "cd.percona.com") + @patch("aws_resource_cleanup.openshift.dns.DRY_RUN", True) + def test_skips_deletion_in_dry_run_mode(self, mock_boto_client): + """ + GIVEN OpenShift cluster DNS records exist + WHEN cleanup_route53_records is called in DRY_RUN mode + THEN no changes should be made + """ + mock_route53 = Mock() + mock_boto_client.return_value = mock_route53 + + mock_route53.list_hosted_zones.return_value = { + "HostedZones": [{"Id": "/hostedzone/Z123", "Name": "cd.percona.com."}] + } + + mock_route53.list_resource_record_sets.return_value = { + "ResourceRecordSets": [ + { + "Name": "api.test-cluster.cd.percona.com.", + "Type": "A", + "TTL": 300, + "ResourceRecords": [{"Value": "1.2.3.4"}], + } + ] + } + + cleanup_route53_records("test-cluster", "us-east-1") + + mock_route53.change_resource_record_sets.assert_not_called() + + @patch("aws_resource_cleanup.openshift.dns.boto3.client") + @patch("aws_resource_cleanup.openshift.dns.OPENSHIFT_BASE_DOMAIN", "cd.percona.com") + def test_handles_missing_hosted_zone(self, mock_boto_client): + """ + GIVEN hosted zone does not exist + WHEN cleanup_route53_records is called + THEN function should return without error + """ + mock_route53 = Mock() + mock_boto_client.return_value = mock_route53 + + mock_route53.list_hosted_zones.return_value = { + "HostedZones": [{"Id": "/hostedzone/Z999", "Name": "other-domain.com."}] + } + + cleanup_route53_records("test-cluster", "us-east-1") + + mock_route53.list_resource_record_sets.assert_not_called() + + @patch("aws_resource_cleanup.openshift.dns.boto3.client") + @patch("aws_resource_cleanup.openshift.dns.OPENSHIFT_BASE_DOMAIN", "cd.percona.com") + @patch("aws_resource_cleanup.openshift.dns.DRY_RUN", False) + def test_handles_no_matching_records(self, mock_boto_client): + """ + GIVEN no DNS records match the cluster name + WHEN cleanup_route53_records is called + THEN no changes should be made + """ + mock_route53 = Mock() + mock_boto_client.return_value = mock_route53 + + mock_route53.list_hosted_zones.return_value = { + "HostedZones": [{"Id": "/hostedzone/Z123", "Name": "cd.percona.com."}] + } + + mock_route53.list_resource_record_sets.return_value = { + "ResourceRecordSets": [ + { + "Name": "other.cd.percona.com.", + "Type": "A", + "TTL": 300, + "ResourceRecords": [{"Value": "1.2.3.4"}], + } + ] + } + + cleanup_route53_records("test-cluster", "us-east-1") + + mock_route53.change_resource_record_sets.assert_not_called() diff --git a/IaC/cdk/aws-resources-cleanup/tests/unit/openshift/test_openshift_logic.py b/IaC/cdk/aws-resources-cleanup/tests/unit/openshift/test_openshift_logic.py new file mode 100644 index 0000000000..bacb7712fb --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/unit/openshift/test_openshift_logic.py @@ -0,0 +1 @@ +"""Unit tests for OpenShift cleanup orchestration logic.""" diff --git a/IaC/cdk/aws-resources-cleanup/tests/unit/openshift/test_openshift_network.py b/IaC/cdk/aws-resources-cleanup/tests/unit/openshift/test_openshift_network.py new file mode 100644 index 0000000000..97e446cb4a --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/unit/openshift/test_openshift_network.py @@ -0,0 +1,403 @@ +"""Unit tests for OpenShift network cleanup functions. + +Tests the individual network resource deletion functions with mocked boto3 clients. +""" + +from __future__ import annotations +import pytest +from unittest.mock import Mock, patch +from botocore.exceptions import ClientError + +from aws_resource_cleanup.openshift.network import ( + delete_nat_gateways, + release_elastic_ips, + cleanup_network_interfaces, + delete_vpc_endpoints, + delete_security_groups, + delete_subnets, + delete_route_tables, + delete_internet_gateway, + delete_vpc, +) + + +@pytest.mark.unit +@pytest.mark.openshift +class TestDeleteNatGateways: + """Test NAT gateway deletion.""" + + @patch("aws_resource_cleanup.openshift.network.boto3.client") + @patch("aws_resource_cleanup.openshift.network.DRY_RUN", False) + def test_deletes_nat_gateways_live_mode(self, mock_boto_client): + """ + GIVEN NAT gateways exist for OpenShift cluster + WHEN delete_nat_gateways is called in live mode + THEN delete_nat_gateway should be called for each NAT gateway + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_ec2.describe_nat_gateways.return_value = { + "NatGateways": [ + {"NatGatewayId": "nat-abc123"}, + {"NatGatewayId": "nat-def456"}, + ] + } + + delete_nat_gateways("test-infra-123", "us-east-1") + + mock_ec2.describe_nat_gateways.assert_called_once_with( + Filters=[ + { + "Name": "tag:kubernetes.io/cluster/test-infra-123", + "Values": ["owned"], + }, + {"Name": "state", "Values": ["available", "pending"]}, + ] + ) + assert mock_ec2.delete_nat_gateway.call_count == 2 + mock_ec2.delete_nat_gateway.assert_any_call(NatGatewayId="nat-abc123") + mock_ec2.delete_nat_gateway.assert_any_call(NatGatewayId="nat-def456") + + @patch("aws_resource_cleanup.openshift.network.boto3.client") + @patch("aws_resource_cleanup.openshift.network.DRY_RUN", True) + def test_skips_deletion_in_dry_run_mode(self, mock_boto_client): + """ + GIVEN NAT gateways exist for OpenShift cluster + WHEN delete_nat_gateways is called in DRY_RUN mode + THEN no deletion should occur + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_ec2.describe_nat_gateways.return_value = { + "NatGateways": [{"NatGatewayId": "nat-abc123"}] + } + + delete_nat_gateways("test-infra-123", "us-east-1") + + mock_ec2.describe_nat_gateways.assert_called_once() + mock_ec2.delete_nat_gateway.assert_not_called() + + @patch("aws_resource_cleanup.openshift.network.boto3.client") + def test_handles_empty_nat_gateway_list(self, mock_boto_client): + """ + GIVEN no NAT gateways exist + WHEN delete_nat_gateways is called + THEN function should complete without errors + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_ec2.describe_nat_gateways.return_value = {"NatGateways": []} + + delete_nat_gateways("test-infra-123", "us-east-1") + + mock_ec2.delete_nat_gateway.assert_not_called() + + +@pytest.mark.unit +@pytest.mark.openshift +class TestReleaseElasticIps: + """Test Elastic IP release.""" + + @patch("aws_resource_cleanup.openshift.network.boto3.client") + @patch("aws_resource_cleanup.openshift.network.DRY_RUN", False) + def test_releases_elastic_ips_live_mode(self, mock_boto_client): + """ + GIVEN Elastic IPs exist for OpenShift cluster + WHEN release_elastic_ips is called in live mode + THEN release_address should be called for each EIP + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_ec2.describe_addresses.return_value = { + "Addresses": [ + {"AllocationId": "eipalloc-abc123"}, + {"AllocationId": "eipalloc-def456"}, + ] + } + + release_elastic_ips("test-infra-123", "us-east-1") + + mock_ec2.describe_addresses.assert_called_once_with( + Filters=[ + { + "Name": "tag:kubernetes.io/cluster/test-infra-123", + "Values": ["owned"], + } + ] + ) + assert mock_ec2.release_address.call_count == 2 + mock_ec2.release_address.assert_any_call(AllocationId="eipalloc-abc123") + mock_ec2.release_address.assert_any_call(AllocationId="eipalloc-def456") + + @patch("aws_resource_cleanup.openshift.network.boto3.client") + @patch("aws_resource_cleanup.openshift.network.DRY_RUN", False) + def test_handles_client_error_gracefully(self, mock_boto_client): + """ + GIVEN an EIP that cannot be released (already released) + WHEN release_elastic_ips is called + THEN ClientError should be caught and function continues + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_ec2.describe_addresses.return_value = { + "Addresses": [{"AllocationId": "eipalloc-abc123"}] + } + mock_ec2.release_address.side_effect = ClientError( + {"Error": {"Code": "InvalidAllocationID.NotFound"}}, "ReleaseAddress" + ) + + # Should not raise exception + release_elastic_ips("test-infra-123", "us-east-1") + + mock_ec2.release_address.assert_called_once() + + +@pytest.mark.unit +@pytest.mark.openshift +class TestCleanupNetworkInterfaces: + """Test network interface cleanup.""" + + @patch("aws_resource_cleanup.openshift.network.boto3.client") + @patch("aws_resource_cleanup.openshift.network.DRY_RUN", False) + def test_deletes_available_enis(self, mock_boto_client): + """ + GIVEN available (orphaned) network interfaces in VPC + WHEN cleanup_network_interfaces is called in live mode + THEN delete_network_interface should be called for each ENI + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_ec2.describe_network_interfaces.return_value = { + "NetworkInterfaces": [ + {"NetworkInterfaceId": "eni-abc123"}, + {"NetworkInterfaceId": "eni-def456"}, + ] + } + + cleanup_network_interfaces("vpc-123456", "us-east-1") + + mock_ec2.describe_network_interfaces.assert_called_once_with( + Filters=[ + {"Name": "vpc-id", "Values": ["vpc-123456"]}, + {"Name": "status", "Values": ["available"]}, + ] + ) + assert mock_ec2.delete_network_interface.call_count == 2 + + +@pytest.mark.unit +@pytest.mark.openshift +class TestDeleteVpcEndpoints: + """Test VPC endpoint deletion.""" + + @patch("aws_resource_cleanup.openshift.network.boto3.client") + @patch("aws_resource_cleanup.openshift.network.DRY_RUN", False) + def test_deletes_vpc_endpoints(self, mock_boto_client): + """ + GIVEN VPC endpoints exist in VPC + WHEN delete_vpc_endpoints is called in live mode + THEN delete_vpc_endpoints should be called for each endpoint + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_ec2.describe_vpc_endpoints.return_value = { + "VpcEndpoints": [ + {"VpcEndpointId": "vpce-abc123"}, + {"VpcEndpointId": "vpce-def456"}, + ] + } + + delete_vpc_endpoints("vpc-123456", "us-east-1") + + assert mock_ec2.delete_vpc_endpoints.call_count == 2 + mock_ec2.delete_vpc_endpoints.assert_any_call(VpcEndpointIds=["vpce-abc123"]) + + +@pytest.mark.unit +@pytest.mark.openshift +class TestDeleteSecurityGroups: + """Test security group deletion with dependency handling.""" + + @patch("aws_resource_cleanup.openshift.network.boto3.client") + @patch("aws_resource_cleanup.openshift.network.DRY_RUN", False) + def test_removes_ingress_rules_before_deletion(self, mock_boto_client): + """ + GIVEN security groups with ingress rules + WHEN delete_security_groups is called + THEN ingress rules should be revoked before deletion + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_ec2.describe_security_groups.return_value = { + "SecurityGroups": [ + { + "GroupId": "sg-abc123", + "GroupName": "openshift-sg", + "IpPermissions": [ + { + "IpProtocol": "tcp", + "FromPort": 443, + "ToPort": 443, + "IpRanges": [{"CidrIp": "0.0.0.0/0"}], + } + ], + } + ] + } + + delete_security_groups("vpc-123456", "us-east-1") + + # Should revoke ingress rules first + mock_ec2.revoke_security_group_ingress.assert_called_once() + # Then delete the security group + mock_ec2.delete_security_group.assert_called_once_with(GroupId="sg-abc123") + + @patch("aws_resource_cleanup.openshift.network.boto3.client") + @patch("aws_resource_cleanup.openshift.network.DRY_RUN", False) + def test_skips_default_security_group(self, mock_boto_client): + """ + GIVEN a VPC with default security group + WHEN delete_security_groups is called + THEN default security group should not be deleted + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_ec2.describe_security_groups.return_value = { + "SecurityGroups": [ + { + "GroupId": "sg-default", + "GroupName": "default", + "IpPermissions": [], + } + ] + } + + delete_security_groups("vpc-123456", "us-east-1") + + mock_ec2.delete_security_group.assert_not_called() + + +@pytest.mark.unit +@pytest.mark.openshift +class TestDeleteSubnets: + """Test subnet deletion.""" + + @patch("aws_resource_cleanup.openshift.network.boto3.client") + @patch("aws_resource_cleanup.openshift.network.DRY_RUN", False) + def test_deletes_all_subnets(self, mock_boto_client): + """ + GIVEN subnets exist in VPC + WHEN delete_subnets is called in live mode + THEN delete_subnet should be called for each subnet + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_ec2.describe_subnets.return_value = { + "Subnets": [ + {"SubnetId": "subnet-abc123"}, + {"SubnetId": "subnet-def456"}, + ] + } + + delete_subnets("vpc-123456", "us-east-1") + + assert mock_ec2.delete_subnet.call_count == 2 + + +@pytest.mark.unit +@pytest.mark.openshift +class TestDeleteRouteTables: + """Test route table deletion.""" + + @patch("aws_resource_cleanup.openshift.network.boto3.client") + @patch("aws_resource_cleanup.openshift.network.DRY_RUN", False) + def test_skips_main_route_table(self, mock_boto_client): + """ + GIVEN route tables including main route table + WHEN delete_route_tables is called + THEN main route table should not be deleted + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_ec2.describe_route_tables.return_value = { + "RouteTables": [ + { + "RouteTableId": "rtb-main", + "Associations": [{"Main": True}], + }, + { + "RouteTableId": "rtb-custom", + "Associations": [{"Main": False}], + }, + ] + } + + delete_route_tables("vpc-123456", "us-east-1") + + # Should only delete non-main route table + mock_ec2.delete_route_table.assert_called_once_with(RouteTableId="rtb-custom") + + +@pytest.mark.unit +@pytest.mark.openshift +class TestDeleteInternetGateway: + """Test internet gateway deletion.""" + + @patch("aws_resource_cleanup.openshift.network.boto3.client") + @patch("aws_resource_cleanup.openshift.network.DRY_RUN", False) + def test_detaches_and_deletes_igw(self, mock_boto_client): + """ + GIVEN internet gateway attached to VPC + WHEN delete_internet_gateway is called + THEN IGW should be detached then deleted + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + mock_ec2.describe_internet_gateways.return_value = { + "InternetGateways": [{"InternetGatewayId": "igw-abc123"}] + } + + delete_internet_gateway("vpc-123456", "us-east-1") + + mock_ec2.detach_internet_gateway.assert_called_once_with( + InternetGatewayId="igw-abc123", VpcId="vpc-123456" + ) + mock_ec2.delete_internet_gateway.assert_called_once_with( + InternetGatewayId="igw-abc123" + ) + + +@pytest.mark.unit +@pytest.mark.openshift +class TestDeleteVpc: + """Test VPC deletion.""" + + @patch("aws_resource_cleanup.openshift.network.boto3.client") + @patch("aws_resource_cleanup.openshift.network.DRY_RUN", False) + def test_deletes_vpc_live_mode(self, mock_boto_client): + """ + GIVEN VPC exists + WHEN delete_vpc is called in live mode + THEN delete_vpc should be called + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + delete_vpc("vpc-123456", "us-east-1") + + mock_ec2.delete_vpc.assert_called_once_with(VpcId="vpc-123456") + + @patch("aws_resource_cleanup.openshift.network.boto3.client") + @patch("aws_resource_cleanup.openshift.network.DRY_RUN", True) + def test_skips_deletion_in_dry_run(self, mock_boto_client): + """ + GIVEN VPC exists + WHEN delete_vpc is called in DRY_RUN mode + THEN no deletion should occur + """ + mock_ec2 = Mock() + mock_boto_client.return_value = mock_ec2 + + delete_vpc("vpc-123456", "us-east-1") + + mock_ec2.delete_vpc.assert_not_called() diff --git a/IaC/cdk/aws-resources-cleanup/tests/unit/openshift/test_openshift_storage.py b/IaC/cdk/aws-resources-cleanup/tests/unit/openshift/test_openshift_storage.py new file mode 100644 index 0000000000..ad129e1cfb --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/unit/openshift/test_openshift_storage.py @@ -0,0 +1,144 @@ +"""Unit tests for OpenShift S3 state storage cleanup.""" + +from __future__ import annotations +import pytest +from unittest.mock import Mock, patch +from botocore.exceptions import ClientError + +from aws_resource_cleanup.openshift.storage import cleanup_s3_state + + +@pytest.mark.unit +@pytest.mark.openshift +class TestCleanupS3State: + """Test S3 state bucket cleanup for OpenShift clusters.""" + + @patch("aws_resource_cleanup.openshift.storage.boto3.client") + @patch("aws_resource_cleanup.openshift.storage.DRY_RUN", False) + def test_deletes_s3_objects_live_mode(self, mock_boto_client): + """ + GIVEN S3 objects exist for OpenShift cluster + WHEN cleanup_s3_state is called in live mode + THEN all cluster objects should be deleted + """ + mock_s3 = Mock() + mock_sts = Mock() + + def client_factory(service_name, **kwargs): + if service_name == "s3": + return mock_s3 + elif service_name == "sts": + return mock_sts + return Mock() + + mock_boto_client.side_effect = client_factory + + mock_sts.get_caller_identity.return_value = {"Account": "123456789012"} + + mock_s3.list_objects_v2.return_value = { + "Contents": [ + {"Key": "test-cluster/terraform.tfstate"}, + {"Key": "test-cluster/metadata.json"}, + ] + } + + cleanup_s3_state("test-cluster", "us-east-1") + + expected_bucket = "openshift-clusters-123456789012-us-east-1" + mock_s3.list_objects_v2.assert_called_once_with( + Bucket=expected_bucket, Prefix="test-cluster/" + ) + + assert mock_s3.delete_object.call_count == 2 + mock_s3.delete_object.assert_any_call( + Bucket=expected_bucket, Key="test-cluster/terraform.tfstate" + ) + mock_s3.delete_object.assert_any_call( + Bucket=expected_bucket, Key="test-cluster/metadata.json" + ) + + @patch("aws_resource_cleanup.openshift.storage.boto3.client") + @patch("aws_resource_cleanup.openshift.storage.DRY_RUN", True) + def test_skips_deletion_in_dry_run_mode(self, mock_boto_client): + """ + GIVEN S3 objects exist for cluster + WHEN cleanup_s3_state is called in DRY_RUN mode + THEN no deletions should occur + """ + mock_s3 = Mock() + mock_sts = Mock() + + def client_factory(service_name, **kwargs): + if service_name == "s3": + return mock_s3 + elif service_name == "sts": + return mock_sts + return Mock() + + mock_boto_client.side_effect = client_factory + + mock_sts.get_caller_identity.return_value = {"Account": "123456789012"} + mock_s3.list_objects_v2.return_value = { + "Contents": [{"Key": "test-cluster/terraform.tfstate"}] + } + + cleanup_s3_state("test-cluster", "us-east-1") + + mock_s3.delete_object.assert_not_called() + + @patch("aws_resource_cleanup.openshift.storage.boto3.client") + @patch("aws_resource_cleanup.openshift.storage.DRY_RUN", False) + def test_handles_no_contents_in_bucket(self, mock_boto_client): + """ + GIVEN no S3 objects exist for cluster + WHEN cleanup_s3_state is called + THEN function should complete without errors + """ + mock_s3 = Mock() + mock_sts = Mock() + + def client_factory(service_name, **kwargs): + if service_name == "s3": + return mock_s3 + elif service_name == "sts": + return mock_sts + return Mock() + + mock_boto_client.side_effect = client_factory + + mock_sts.get_caller_identity.return_value = {"Account": "123456789012"} + mock_s3.list_objects_v2.return_value = {} # No Contents key + + cleanup_s3_state("test-cluster", "us-east-1") + + mock_s3.delete_object.assert_not_called() + + @patch("aws_resource_cleanup.openshift.storage.boto3.client") + @patch("aws_resource_cleanup.openshift.storage.DRY_RUN", False) + def test_handles_missing_bucket_gracefully(self, mock_boto_client): + """ + GIVEN S3 bucket does not exist + WHEN cleanup_s3_state is called + THEN NoSuchBucket error should be handled gracefully + """ + mock_s3 = Mock() + mock_sts = Mock() + + def client_factory(service_name, **kwargs): + if service_name == "s3": + return mock_s3 + elif service_name == "sts": + return mock_sts + return Mock() + + mock_boto_client.side_effect = client_factory + + mock_sts.get_caller_identity.return_value = {"Account": "123456789012"} + mock_s3.list_objects_v2.side_effect = ClientError( + {"Error": {"Code": "NoSuchBucket"}}, "ListObjectsV2" + ) + + # Should not raise exception + cleanup_s3_state("test-cluster", "us-east-1") + + mock_s3.delete_object.assert_not_called() diff --git a/IaC/cdk/aws-resources-cleanup/tests/unit/policies/__init__.py b/IaC/cdk/aws-resources-cleanup/tests/unit/policies/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/IaC/cdk/aws-resources-cleanup/tests/unit/policies/test_long_stopped_policy.py b/IaC/cdk/aws-resources-cleanup/tests/unit/policies/test_long_stopped_policy.py new file mode 100644 index 0000000000..74d96795ea --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/unit/policies/test_long_stopped_policy.py @@ -0,0 +1,94 @@ +"""Unit tests for long-stopped instance detection logic. + +Tests the check_long_stopped() policy function. +""" + +from __future__ import annotations +import pytest + +from aws_resource_cleanup.ec2.policies import check_long_stopped + + +@pytest.mark.unit +@pytest.mark.policies +class TestLongStoppedPolicy: + """Test long-stopped instance detection logic.""" + + def test_instance_stopped_over_30_days_creates_terminate_action( + self, make_instance, current_time, tags_dict_from_instance + ): + """ + GIVEN an instance stopped for more than 30 days + WHEN check_long_stopped is called + THEN a TERMINATE action should be returned + """ + instance = make_instance( + name="long-stopped", + state="stopped", + days_old=35, + billing_tag="test-billing" + ) + tags_dict = tags_dict_from_instance(instance) + + action = check_long_stopped(instance, tags_dict, current_time) + + assert action is not None + assert action.action == "TERMINATE" + assert action.name == "long-stopped" + # Should be overdue by ~5 days (35 - 30) + assert 4.9 < action.days_overdue < 5.1 + + def test_running_instance_returns_none( + self, make_instance, current_time, tags_dict_from_instance + ): + """ + GIVEN a running instance (regardless of age) + WHEN check_long_stopped is called + THEN None should be returned + """ + instance = make_instance( + state="running", + days_old=35 + ) + tags_dict = tags_dict_from_instance(instance) + + action = check_long_stopped(instance, tags_dict, current_time) + + assert action is None + + def test_recently_stopped_instance_returns_none( + self, make_instance, current_time, tags_dict_from_instance + ): + """ + GIVEN an instance stopped for less than 30 days + WHEN check_long_stopped is called + THEN None should be returned + """ + instance = make_instance( + state="stopped", + days_old=30 # Exactly 30 days + ) + tags_dict = tags_dict_from_instance(instance) + + action = check_long_stopped(instance, tags_dict, current_time) + + assert action is None + + def test_stopped_instance_at_31_days_creates_action( + self, make_instance, current_time, tags_dict_from_instance + ): + """ + GIVEN an instance stopped for exactly 31 days + WHEN check_long_stopped is called + THEN a TERMINATE action should be returned + """ + instance = make_instance( + state="stopped", + days_old=31 + ) + tags_dict = tags_dict_from_instance(instance) + + action = check_long_stopped(instance, tags_dict, current_time) + + assert action is not None + assert action.action == "TERMINATE" \ No newline at end of file diff --git a/IaC/cdk/aws-resources-cleanup/tests/unit/policies/test_stop_policy.py b/IaC/cdk/aws-resources-cleanup/tests/unit/policies/test_stop_policy.py new file mode 100644 index 0000000000..53ab27ab04 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/unit/policies/test_stop_policy.py @@ -0,0 +1,113 @@ +"""Unit tests for stop-after-days policy logic. + +Tests the check_stop_after_days() policy function. +""" + +from __future__ import annotations +import pytest + +from aws_resource_cleanup.ec2.policies import check_stop_after_days + + +@pytest.mark.unit +@pytest.mark.policies +class TestStopAfterDaysPolicy: + """Test stop-after-days policy logic.""" + + def test_running_instance_past_stop_threshold_creates_stop_action( + self, make_instance, current_time, tags_dict_from_instance + ): + """ + GIVEN a running instance past stop-after-days threshold + WHEN check_stop_after_days is called + THEN a STOP action should be returned + """ + instance = make_instance( + name="pmm-staging", + state="running", + days_old=8, + stop_after_days=7, + billing_tag="pmm-staging", + owner="test-user" + ) + tags_dict = tags_dict_from_instance(instance) + + action = check_stop_after_days(instance, tags_dict, current_time) + + assert action is not None + assert action.action == "STOP" + assert action.name == "pmm-staging" + assert action.billing_tag == "pmm-staging" + # Should be overdue by 1 day + assert 0.99 < action.days_overdue < 1.01 + + def test_stopped_instance_returns_none( + self, make_instance, current_time, tags_dict_from_instance + ): + """ + GIVEN a stopped instance with stop-after-days tag + WHEN check_stop_after_days is called + THEN None should be returned (already stopped) + """ + instance = make_instance( + state="stopped", + days_old=8, + stop_after_days=7 + ) + tags_dict = tags_dict_from_instance(instance) + + action = check_stop_after_days(instance, tags_dict, current_time) + + assert action is None + + def test_instance_without_stop_tag_returns_none( + self, make_instance, current_time, tags_dict_from_instance + ): + """ + GIVEN an instance without stop-after-days tag + WHEN check_stop_after_days is called + THEN None should be returned + """ + instance = make_instance(state="running") + tags_dict = tags_dict_from_instance(instance) + + action = check_stop_after_days(instance, tags_dict, current_time) + + assert action is None + + def test_instance_before_stop_threshold_returns_none( + self, make_instance, current_time, tags_dict_from_instance + ): + """ + GIVEN a running instance not yet past stop threshold + WHEN check_stop_after_days is called + THEN None should be returned + """ + instance = make_instance( + state="running", + days_old=6, + stop_after_days=7 + ) + tags_dict = tags_dict_from_instance(instance) + + action = check_stop_after_days(instance, tags_dict, current_time) + + assert action is None + + def test_invalid_stop_after_days_value_returns_none( + self, make_instance, current_time, tags_dict_from_instance + ): + """ + GIVEN an instance with non-numeric stop-after-days value + WHEN check_stop_after_days is called + THEN None should be returned (graceful handling) + """ + instance = make_instance( + state="running", + **{"stop-after-days": "invalid"} + ) + tags_dict = tags_dict_from_instance(instance) + + action = check_stop_after_days(instance, tags_dict, current_time) + + assert action is None \ No newline at end of file diff --git a/IaC/cdk/aws-resources-cleanup/tests/unit/policies/test_ttl_policy.py b/IaC/cdk/aws-resources-cleanup/tests/unit/policies/test_ttl_policy.py new file mode 100644 index 0000000000..460c493e95 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/unit/policies/test_ttl_policy.py @@ -0,0 +1,252 @@ +"""Unit tests for TTL expiration policy logic. + +Tests the check_ttl_expiration() policy function. +""" + +from __future__ import annotations +import pytest + +from aws_resource_cleanup.ec2.policies import check_ttl_expiration + + +@pytest.mark.unit +@pytest.mark.policies +class TestTTLExpirationDetection: + """Test TTL expiration detection logic.""" + + def test_instance_with_expired_ttl_creates_terminate_action( + self, make_instance, current_time, tags_dict_from_instance + ): + """ + GIVEN an instance with expired TTL (created 2h ago, TTL 1h) + WHEN check_ttl_expiration is called + THEN a TERMINATE action should be returned with correct days overdue + """ + instance = make_instance( + name="test-instance", + ttl_expired=True, + hours_old=2, + ttl_hours=1, + billing_tag="test-billing", + owner="test-user" + ) + tags_dict = tags_dict_from_instance(instance) + + action = check_ttl_expiration(instance, tags_dict, current_time) + + assert action is not None + assert action.action == "TERMINATE" + assert action.instance_id == "i-test123456" + assert action.name == "test-instance" + assert action.billing_tag == "test-billing" + assert action.owner == "test-user" + # 1 hour overdue = 3600 seconds = 0.0417 days + assert 0.04 < action.days_overdue < 0.05 + + def test_instance_with_valid_ttl_returns_none( + self, instance_builder, current_time, tags_dict_from_instance + ): + """ + GIVEN an instance with TTL not yet expired + WHEN check_ttl_expiration is called + THEN None should be returned (no action) + """ + # Manually add TTL tags for valid (non-expired) TTL + # Use instance_builder directly from fixture + creation_time = current_time - 1800 # 30 minutes ago + instance = ( + instance_builder + .with_ttl_tags(creation_time, 1) # 1 hour TTL + .with_billing_tag("test-billing") + .build() + ) + tags_dict = tags_dict_from_instance(instance) + + action = check_ttl_expiration(instance, tags_dict, current_time) + + assert action is None + + def test_instance_without_ttl_tags_returns_none( + self, make_instance, current_time, tags_dict_from_instance + ): + """ + GIVEN an instance without TTL tags + WHEN check_ttl_expiration is called + THEN None should be returned + """ + instance = make_instance(billing_tag="test-billing") + tags_dict = tags_dict_from_instance(instance) + + action = check_ttl_expiration(instance, tags_dict, current_time) + + assert action is None + + def test_instance_with_partial_ttl_tags_returns_none( + self, make_instance, current_time, tags_dict_from_instance + ): + """ + GIVEN an instance with only creation-time but no TTL + WHEN check_ttl_expiration is called + THEN None should be returned + """ + instance = make_instance( + billing_tag="test-billing", + **{"creation-time": str(current_time)} + ) + tags_dict = tags_dict_from_instance(instance) + + action = check_ttl_expiration(instance, tags_dict, current_time) + + assert action is None + + def test_instance_with_invalid_ttl_values_returns_none( + self, make_instance, current_time, tags_dict_from_instance + ): + """ + GIVEN an instance with non-numeric TTL tags + WHEN check_ttl_expiration is called + THEN None should be returned (graceful handling) + """ + instance = make_instance( + **{ + "creation-time": "invalid", + "delete-cluster-after-hours": "not-a-number" + } + ) + tags_dict = tags_dict_from_instance(instance) + + action = check_ttl_expiration(instance, tags_dict, current_time) + + assert action is None + + def test_days_overdue_calculation_accurate( + self, instance_builder, current_time, tags_dict_from_instance + ): + """ + GIVEN an instance expired by exactly 3 days minus 1 hour + WHEN check_ttl_expiration is called + THEN days_overdue should be approximately 2.96 days + """ + # Created 3 days ago with 1 hour TTL = overdue by 2.958 days + creation_time = current_time - 259200 # 3 days ago + instance = ( + instance_builder + .with_ttl_tags(creation_time, 1) # 1 hour TTL + .with_billing_tag("test-billing") + .build() + ) + tags_dict = tags_dict_from_instance(instance) + + action = check_ttl_expiration(instance, tags_dict, current_time) + + assert action is not None + # Overdue = (3 days - 1 hour) = (259200 - 3600) / 86400 = 2.958 days + assert 2.95 < action.days_overdue < 2.97 + + +@pytest.mark.unit +@pytest.mark.policies +class TestTTLBoundaryConditions: + """Test TTL expiration at exact boundaries.""" + + @pytest.mark.parametrize( + "hours_offset,ttl_hours,should_expire", + [ + (1.001, 1, True), # Expired by ~1 second + (1.0, 1, True), # Exactly at expiration (>= comparison) + (0.999, 1, False), # Not yet expired + (24.001, 24, True), # 1 day TTL expired by ~1 second + (24.0, 24, True), # 1 day TTL exactly at expiration + ], + ) + def test_ttl_boundary_conditions( + self, + instance_builder, + current_time, + tags_dict_from_instance, + hours_offset, + ttl_hours, + should_expire, + ): + """ + GIVEN instances at TTL boundary conditions + WHEN check_ttl_expiration is called + THEN correct expiration decision should be made + """ + if should_expire: + creation_time = current_time - int(hours_offset * 3600) + else: + # Create with valid TTL (not yet expired) + creation_time = current_time - int(hours_offset * 3600) + 100 + + instance = ( + instance_builder + .with_ttl_tags(creation_time, ttl_hours) + .with_billing_tag("test-billing") + .build() + ) + + tags_dict = tags_dict_from_instance(instance) + action = check_ttl_expiration(instance, tags_dict, current_time) + + if should_expire: + assert action is not None + assert action.action == "TERMINATE" + else: + assert action is None + + +@pytest.mark.unit +@pytest.mark.policies +@pytest.mark.openshift +class TestTTLClusterHandling: + """Test TTL policy for clustered resources.""" + + def test_openshift_cluster_instance_gets_special_action( + self, make_instance, current_time, tags_dict_from_instance + ): + """ + GIVEN an OpenShift cluster instance with expired TTL + WHEN check_ttl_expiration is called + THEN TERMINATE_OPENSHIFT_CLUSTER action should be returned + """ + instance = make_instance( + ttl_expired=True, + hours_old=2, + ttl_hours=1, + openshift=True, + infra_id="test-infra-123", + cluster_name="test-cluster" + ) + tags_dict = tags_dict_from_instance(instance) + + action = check_ttl_expiration(instance, tags_dict, current_time) + + assert action is not None + assert action.action == "TERMINATE_OPENSHIFT_CLUSTER" + # Cluster name is extracted from kubernetes.io/cluster/ tag, not from cluster-name tag + assert action.cluster_name == "test-infra-123" + + def test_eks_cluster_instance_gets_cluster_action( + self, make_instance, current_time, tags_dict_from_instance + ): + """ + GIVEN an EKS cluster instance with expired TTL + WHEN check_ttl_expiration is called + THEN TERMINATE_CLUSTER action should be returned + """ + instance = make_instance( + ttl_expired=True, + hours_old=2, + ttl_hours=1, + eks=True, + eks_cluster="test-eks", + cluster_name="test-eks" + ) + tags_dict = tags_dict_from_instance(instance) + + action = check_ttl_expiration(instance, tags_dict, current_time) + + assert action is not None + assert action.action == "TERMINATE_CLUSTER" + assert action.cluster_name == "test-eks" \ No newline at end of file diff --git a/IaC/cdk/aws-resources-cleanup/tests/unit/policies/test_untagged_policy.py b/IaC/cdk/aws-resources-cleanup/tests/unit/policies/test_untagged_policy.py new file mode 100644 index 0000000000..baebd416ed --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/unit/policies/test_untagged_policy.py @@ -0,0 +1,113 @@ +"""Unit tests for untagged instance detection logic. + +Tests the check_untagged() policy function. +""" + +from __future__ import annotations +import pytest + +from aws_resource_cleanup.ec2.policies import check_untagged + + +@pytest.mark.unit +@pytest.mark.policies +class TestUntaggedPolicy: + """Test untagged instance detection logic.""" + + def test_instance_without_billing_tag_over_threshold_creates_terminate_action( + self, make_instance, current_time, tags_dict_from_instance + ): + """ + GIVEN an instance without billing tag running > 30 minutes + WHEN check_untagged is called + THEN a TERMINATE action should be returned + """ + instance = make_instance( + name="untagged-instance", + hours_old=1 + # No billing tag + ) + tags_dict = tags_dict_from_instance(instance) + + action = check_untagged(instance, tags_dict, current_time) + + assert action is not None + assert action.action == "TERMINATE" + assert action.billing_tag == "" + assert "Missing billing tag" in action.reason + + def test_instance_with_valid_billing_tag_returns_none( + self, make_instance, current_time, tags_dict_from_instance + ): + """ + GIVEN an instance with valid billing tag + WHEN check_untagged is called + THEN None should be returned + """ + instance = make_instance( + hours_old=1, + billing_tag="pmm-staging" + ) + tags_dict = tags_dict_from_instance(instance) + + action = check_untagged(instance, tags_dict, current_time) + + assert action is None + + def test_untagged_instance_under_threshold_returns_none( + self, instance_builder, current_time, tags_dict_from_instance, time_utils + ): + """ + GIVEN an instance without billing tag but under threshold (< 30 min) + WHEN check_untagged is called + THEN None should be returned (grace period) + """ + # Create instance with recent launch time + instance = ( + instance_builder + .with_launch_time(time_utils.seconds_ago(1200)) # 20 minutes ago + .build() + ) + tags_dict = tags_dict_from_instance(instance) + + action = check_untagged(instance, tags_dict, current_time) + + assert action is None + + @pytest.mark.parametrize( + "minutes_running,should_terminate", + [ + (29, False), # Under threshold (29 < 30) + (30, True), # At threshold + (31, True), # Over threshold + (60, True), # Well over threshold + ], + ) + def test_untagged_threshold_boundary( + self, + instance_builder, + current_time, + tags_dict_from_instance, + time_utils, + minutes_running, + should_terminate, + ): + """ + GIVEN untagged instances at various runtime thresholds + WHEN check_untagged is called + THEN correct termination decision should be made + """ + instance = ( + instance_builder + .with_launch_time(time_utils.seconds_ago(minutes_running * 60)) + .build() + ) + tags_dict = tags_dict_from_instance(instance) + + action = check_untagged(instance, tags_dict, current_time) + + if should_terminate: + assert action is not None + assert action.action == "TERMINATE" + else: + assert action is None \ No newline at end of file diff --git a/IaC/cdk/aws-resources-cleanup/tests/unit/test_billing_validation.py b/IaC/cdk/aws-resources-cleanup/tests/unit/test_billing_validation.py new file mode 100644 index 0000000000..e36a68989f --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/unit/test_billing_validation.py @@ -0,0 +1,89 @@ +"""Unit tests for billing tag validation logic. + +Tests the has_valid_billing_tag() function. +""" + +from __future__ import annotations +import datetime +import pytest + +from aws_resource_cleanup.utils import has_valid_billing_tag + + +@pytest.mark.unit +class TestBillingTagValidation: + """Test billing tag validation rules.""" + + def test_valid_category_billing_tag_accepted(self): + """ + GIVEN a category-based billing tag (e.g., "pmm-staging") + WHEN has_valid_billing_tag is called + THEN True should be returned + """ + tags_dict = {"iit-billing-tag": "pmm-staging"} + assert has_valid_billing_tag(tags_dict) is True + + def test_future_timestamp_billing_tag_accepted(self): + """ + GIVEN a future Unix timestamp billing tag + WHEN has_valid_billing_tag is called + THEN True should be returned + """ + # Future timestamp (year 2030) + future_timestamp = str(int(datetime.datetime(2030, 1, 1).timestamp())) + tags_dict = {"iit-billing-tag": future_timestamp} + assert has_valid_billing_tag(tags_dict) is True + + def test_expired_timestamp_billing_tag_rejected(self): + """ + GIVEN an expired Unix timestamp billing tag + WHEN has_valid_billing_tag is called + THEN False should be returned + """ + # Past timestamp (year 2020) + past_timestamp = str(int(datetime.datetime(2020, 1, 1).timestamp())) + tags_dict = {"iit-billing-tag": past_timestamp} + assert has_valid_billing_tag(tags_dict) is False + + def test_missing_billing_tag_rejected(self): + """ + GIVEN tags without iit-billing-tag + WHEN has_valid_billing_tag is called + THEN False should be returned + """ + tags_dict = {"Name": "test-instance"} + assert has_valid_billing_tag(tags_dict) is False + + def test_empty_billing_tag_rejected(self): + """ + GIVEN an empty billing tag value + WHEN has_valid_billing_tag is called + THEN False should be returned + """ + tags_dict = {"iit-billing-tag": ""} + assert has_valid_billing_tag(tags_dict) is False + + +@pytest.mark.unit +@pytest.mark.parametrize( + "billing_tag", + [ + "pmm-staging", + "CirrusCI", + "eks", + "openshift", + "test-team", + "custom-123", + ], +) +class TestVariousCategoryTags: + """Test various category-based billing tags.""" + + def test_various_category_tags_accepted(self, billing_tag): + """ + GIVEN various category-based billing tags + WHEN has_valid_billing_tag is called + THEN True should be returned for all + """ + tags_dict = {"iit-billing-tag": billing_tag} + assert has_valid_billing_tag(tags_dict) is True \ No newline at end of file diff --git a/IaC/cdk/aws-resources-cleanup/tests/unit/test_cluster_detection.py b/IaC/cdk/aws-resources-cleanup/tests/unit/test_cluster_detection.py new file mode 100644 index 0000000000..9a68fb67e2 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/unit/test_cluster_detection.py @@ -0,0 +1,79 @@ +"""Unit tests for cluster name extraction logic. + +Tests the extract_cluster_name() function. +""" + +from __future__ import annotations +import pytest + +from aws_resource_cleanup.utils import extract_cluster_name + + +@pytest.mark.unit +@pytest.mark.aws +class TestClusterNameExtraction: + """Test cluster name extraction from various tag formats.""" + + def test_extract_kubernetes_cluster_name_from_tag(self): + """ + GIVEN tags with kubernetes.io/cluster/ tag + WHEN extract_cluster_name is called + THEN cluster name should be extracted + """ + tags_dict = {"kubernetes.io/cluster/test-eks-cluster": "owned"} + + cluster_name = extract_cluster_name(tags_dict) + + assert cluster_name == "test-eks-cluster" + + def test_extract_eks_cluster_name_from_aws_tag(self): + """ + GIVEN tags with aws:eks:cluster-name tag + WHEN extract_cluster_name is called + THEN cluster name should be extracted + """ + tags_dict = {"aws:eks:cluster-name": "my-eks-cluster"} + + cluster_name = extract_cluster_name(tags_dict) + + assert cluster_name == "my-eks-cluster" + + def test_extract_openshift_cluster_name(self): + """ + GIVEN tags with OpenShift kubernetes tag + WHEN extract_cluster_name is called + THEN infra ID should be extracted as cluster name + """ + tags_dict = {"kubernetes.io/cluster/openshift-infra-abc123": "owned"} + + cluster_name = extract_cluster_name(tags_dict) + + assert cluster_name == "openshift-infra-abc123" + + def test_no_cluster_name_returns_none(self): + """ + GIVEN tags without cluster identifiers + WHEN extract_cluster_name is called + THEN None should be returned + """ + tags_dict = {"Name": "standalone-instance", "iit-billing-tag": "test"} + + cluster_name = extract_cluster_name(tags_dict) + + assert cluster_name is None + + def test_kubernetes_tag_takes_precedence_over_aws_tag(self): + """ + GIVEN tags with both kubernetes.io and aws:eks tags + WHEN extract_cluster_name is called + THEN kubernetes.io cluster name should be returned + """ + tags_dict = { + "kubernetes.io/cluster/k8s-cluster": "owned", + "aws:eks:cluster-name": "eks-cluster", + } + + cluster_name = extract_cluster_name(tags_dict) + + # kubernetes.io tag is checked first in the function + assert cluster_name == "k8s-cluster" \ No newline at end of file diff --git a/IaC/cdk/aws-resources-cleanup/tests/unit/test_policy_priority.py b/IaC/cdk/aws-resources-cleanup/tests/unit/test_policy_priority.py new file mode 100644 index 0000000000..15c63da069 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/unit/test_policy_priority.py @@ -0,0 +1,130 @@ +"""Unit tests for policy evaluation priority and decision logic. + +Tests the priority ordering of cleanup policies. +""" + +from __future__ import annotations +import pytest + +from aws_resource_cleanup.ec2.policies import ( + check_ttl_expiration, + check_stop_after_days, + check_long_stopped, + check_untagged, +) +from aws_resource_cleanup.ec2.instances import is_protected + + +@pytest.mark.unit +@pytest.mark.policies +class TestPolicyPriority: + """Test policy evaluation priority and decision logic.""" + + def test_ttl_policy_should_take_priority_over_untagged( + self, make_instance, current_time, tags_dict_from_instance + ): + """ + GIVEN an instance with expired TTL but no billing tag + WHEN policies are evaluated in order + THEN TTL policy should create action before untagged policy + """ + instance = make_instance( + ttl_expired=True, + hours_old=2, + ttl_hours=1 + # No billing tag + ) + tags_dict = tags_dict_from_instance(instance) + + # TTL should be checked first + ttl_action = check_ttl_expiration(instance, tags_dict, current_time) + assert ttl_action is not None + assert ttl_action.action in ["TERMINATE", "TERMINATE_CLUSTER"] + + # Untagged would also match but shouldn't be reached + untagged_action = check_untagged(instance, tags_dict, current_time) + assert untagged_action is not None # Would also create action + + def test_stop_policy_checked_before_long_stopped( + self, make_instance, current_time, tags_dict_from_instance + ): + """ + GIVEN a running instance with stop-after-days policy + WHEN policies are evaluated + THEN stop-after-days should trigger before long-stopped + """ + # Instance running for 8 days + instance = make_instance( + state="running", + days_old=8, + stop_after_days=7, + billing_tag="pmm-staging" + ) + tags_dict = tags_dict_from_instance(instance) + + # Stop policy should match + stop_action = check_stop_after_days(instance, tags_dict, current_time) + assert stop_action is not None + assert stop_action.action == "STOP" + + # Long-stopped wouldn't match (instance is running) + long_stopped_action = check_long_stopped(instance, tags_dict, current_time) + assert long_stopped_action is None + + def test_protected_instance_skipped_by_all_policies( + self, make_instance, tags_dict_from_instance + ): + """ + GIVEN a protected instance (persistent billing tag) + WHEN is_protected is checked + THEN it should return True (handler would skip policy checks) + """ + # Protected instance with persistent tag + instance = make_instance( + name="protected-instance", + protected=True + ) + tags_dict = tags_dict_from_instance(instance) + + # Should be protected (is_protected returns tuple: (bool, str)) + is_protected_flag, reason = is_protected(tags_dict) + assert is_protected_flag is True + assert reason != "" # Should have a protection reason + + # In the handler, protected instances are skipped before policy checks + # So policies wouldn't even be evaluated + + def test_instance_with_multiple_matching_policies( + self, make_instance, current_time, tags_dict_from_instance + ): + """ + GIVEN an instance matching multiple policies (stopped, old, untagged) + WHEN policies are evaluated in order + THEN first matching policy should determine the action + """ + # Instance: stopped for 35 days, no billing tag + instance = make_instance( + state="stopped", + days_old=35 + # No billing tag + ) + tags_dict = tags_dict_from_instance(instance) + + # TTL doesn't apply (no TTL tags) + ttl_action = check_ttl_expiration(instance, tags_dict, current_time) + assert ttl_action is None + + # Stop policy doesn't apply (already stopped) + stop_action = check_stop_after_days(instance, tags_dict, current_time) + assert stop_action is None + + # Long-stopped applies (stopped > 30 days) + long_stopped_action = check_long_stopped(instance, tags_dict, current_time) + assert long_stopped_action is not None + assert long_stopped_action.action == "TERMINATE" + + # Untagged also applies but comes after long-stopped in priority + untagged_action = check_untagged(instance, tags_dict, current_time) + assert untagged_action is not None + + # In handler, long-stopped would be used (checked first in the or chain) \ No newline at end of file diff --git a/IaC/cdk/aws-resources-cleanup/tests/unit/test_protection_logic.py b/IaC/cdk/aws-resources-cleanup/tests/unit/test_protection_logic.py new file mode 100644 index 0000000000..8dafa307b0 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/unit/test_protection_logic.py @@ -0,0 +1,159 @@ +"""Unit tests for resource protection detection logic. + +Tests the is_protected() function and protection rules without AWS mocking. +""" + +from __future__ import annotations +import pytest + +from aws_resource_cleanup.ec2.instances import is_protected + + +@pytest.mark.unit +@pytest.mark.policies +class TestBasicProtection: + """Test basic protection detection.""" + + def test_instance_with_persistent_billing_tag_is_protected( + self, make_instance, tags_dict_from_instance + ): + """ + GIVEN an instance with a persistent billing tag + WHEN is_protected is called + THEN True should be returned (instance is protected) + """ + instance = make_instance( + name="protected", + billing_tag="jenkins-dev-pmm" + ) + tags_dict = tags_dict_from_instance(instance) + + is_protected_flag, reason = is_protected(tags_dict, "i-test123") + assert is_protected_flag is True + assert "jenkins-dev-pmm" in reason + + def test_instance_with_valid_billing_tag_is_protected( + self, make_instance, tags_dict_from_instance + ): + """ + GIVEN an instance with valid non-persistent billing tag + WHEN is_protected is called + THEN True should be returned (protected unless TTL overrides) + """ + instance = make_instance(billing_tag="pmm-staging") + tags_dict = tags_dict_from_instance(instance) + + is_protected_flag, reason = is_protected(tags_dict, "i-test123") + assert is_protected_flag is True + assert "pmm-staging" in reason + + def test_untagged_instance_is_not_protected( + self, make_instance, tags_dict_from_instance + ): + """ + GIVEN an instance without any billing tag + WHEN is_protected is called + THEN False should be returned + """ + instance = make_instance(name="untagged") + tags_dict = tags_dict_from_instance(instance) + + is_protected_flag, reason = is_protected(tags_dict, "i-test123") + assert is_protected_flag is False + assert reason == "" + + def test_instance_with_invalid_billing_tag_not_protected( + self, make_instance, tags_dict_from_instance + ): + """ + GIVEN an instance with expired timestamp billing tag + WHEN is_protected is called + THEN False should be returned + """ + # Expired timestamp (past date) + expired_timestamp = "1000000" # Very old timestamp + instance = make_instance(billing_tag=expired_timestamp) + tags_dict = tags_dict_from_instance(instance) + + # Note: is_protected uses has_valid_billing_tag which checks timestamps + # An expired timestamp should not protect the instance + is_protected_flag, reason = is_protected(tags_dict, "i-test123") + assert is_protected_flag is False + + +@pytest.mark.unit +@pytest.mark.policies +class TestPersistentTags: + """Test all persistent billing tags are properly protected.""" + + @pytest.mark.parametrize( + "persistent_tag", + [ + "jenkins-cloud", + "jenkins-fb", + "jenkins-pg", + "jenkins-ps3", + "jenkins-ps57", + "jenkins-ps80", + "jenkins-psmdb", + "jenkins-pxb", + "jenkins-pxc", + "jenkins-rel", + "pmm-dev", + ], + ) + def test_all_persistent_tags_are_protected( + self, make_instance, tags_dict_from_instance, persistent_tag + ): + """ + GIVEN an instance with any persistent billing tag + WHEN is_protected is called + THEN True should be returned + """ + instance = make_instance(billing_tag=persistent_tag) + tags_dict = tags_dict_from_instance(instance) + + is_protected_flag, reason = is_protected(tags_dict, "i-test123") + assert is_protected_flag is True + assert persistent_tag in reason + + +@pytest.mark.unit +@pytest.mark.policies +class TestProtectionOverrides: + """Test TTL and stop-after-days override protection.""" + + def test_instance_with_billing_tag_and_ttl_is_not_protected( + self, make_instance, tags_dict_from_instance + ): + """ + GIVEN an instance with billing tag but also TTL tags + WHEN is_protected is called + THEN False should be returned (TTL takes precedence) + """ + instance = make_instance( + billing_tag="pmm-staging", + ttl_expired=True, + ttl_hours=1 + ) + tags_dict = tags_dict_from_instance(instance) + + is_protected_flag, reason = is_protected(tags_dict, "i-test123") + assert is_protected_flag is False + + def test_instance_with_billing_tag_and_stop_policy_not_protected( + self, make_instance, tags_dict_from_instance + ): + """ + GIVEN an instance with billing tag but also stop-after-days + WHEN is_protected is called + THEN False should be returned (stop policy takes precedence) + """ + instance = make_instance( + billing_tag="pmm-staging", + stop_after_days=7 + ) + tags_dict = tags_dict_from_instance(instance) + + is_protected_flag, reason = is_protected(tags_dict, "i-test123") + assert is_protected_flag is False \ No newline at end of file diff --git a/IaC/cdk/aws-resources-cleanup/tests/unit/test_tag_conversion.py b/IaC/cdk/aws-resources-cleanup/tests/unit/test_tag_conversion.py new file mode 100644 index 0000000000..4536147e00 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/unit/test_tag_conversion.py @@ -0,0 +1,52 @@ +"""Unit tests for tag format conversion utilities. + +Tests the convert_tags_to_dict() function. +""" + +from __future__ import annotations +import pytest + +from aws_resource_cleanup.utils import convert_tags_to_dict + + +@pytest.mark.unit +class TestTagConversion: + """Test AWS tag list to dictionary conversion.""" + + def test_convert_tags_list_to_dict(self): + """ + GIVEN AWS tags list format + WHEN convert_tags_to_dict is called + THEN dictionary should be returned + """ + tags_list = [ + {"Key": "Name", "Value": "test-instance"}, + {"Key": "iit-billing-tag", "Value": "pmm-staging"}, + {"Key": "owner", "Value": "test-user"}, + ] + + tags_dict = convert_tags_to_dict(tags_list) + + assert tags_dict == { + "Name": "test-instance", + "iit-billing-tag": "pmm-staging", + "owner": "test-user", + } + + def test_convert_empty_tags_list(self): + """ + GIVEN empty tags list + WHEN convert_tags_to_dict is called + THEN empty dictionary should be returned + """ + tags_dict = convert_tags_to_dict([]) + assert tags_dict == {} + + def test_convert_none_tags(self): + """ + GIVEN None as tags + WHEN convert_tags_to_dict is called + THEN empty dictionary should be returned + """ + tags_dict = convert_tags_to_dict(None) + assert tags_dict == {} \ No newline at end of file diff --git a/IaC/cdk/aws-resources-cleanup/tests/unit/volumes/test_volume_detection.py b/IaC/cdk/aws-resources-cleanup/tests/unit/volumes/test_volume_detection.py new file mode 100644 index 0000000000..6caff608e6 --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/unit/volumes/test_volume_detection.py @@ -0,0 +1,270 @@ +"""Unit tests for EBS volume cleanup detection logic. + +Tests the check_unattached_volume() and is_volume_protected() functions. +""" + +from __future__ import annotations +import pytest +import datetime + +from aws_resource_cleanup.ec2.volumes import check_unattached_volume, is_volume_protected + + +@pytest.mark.unit +@pytest.mark.volumes +class TestVolumeDetection: + """Test volume cleanup detection logic.""" + + def test_available_volume_with_name_tag_creates_delete_action( + self, volume_builder, current_time, tags_dict_from_instance + ): + """ + GIVEN an available (unattached) volume with Name tag + WHEN check_unattached_volume is called + THEN a DELETE_VOLUME action should be returned + """ + create_time = datetime.datetime.fromtimestamp( + current_time - 86400, # 1 day old + tz=datetime.timezone.utc + ) + volume = ( + volume_builder + .with_name("test-volume") + .with_state("available") + .with_create_time(create_time) + .build() + ) + tags_dict = tags_dict_from_instance(volume) + + action = check_unattached_volume(volume, tags_dict, current_time) + + assert action is not None + assert action.action == "DELETE_VOLUME" + assert action.volume_id == "vol-test123456" + assert action.name == "test-volume" + assert action.billing_tag == "" + assert action.resource_type == "volume" + assert 0.9 < action.days_overdue < 1.1 # ~1 day old + + def test_in_use_volume_returns_none( + self, volume_builder, current_time, tags_dict_from_instance + ): + """ + GIVEN a volume in 'in-use' state (attached) + WHEN check_unattached_volume is called + THEN None should be returned (not eligible for cleanup) + """ + volume = ( + volume_builder + .with_name("attached-volume") + .with_state("in-use") + .build() + ) + tags_dict = tags_dict_from_instance(volume) + + action = check_unattached_volume(volume, tags_dict, current_time) + + assert action is None + + def test_volume_without_name_tag_creates_delete_action( + self, volume_builder, current_time, tags_dict_from_instance + ): + """ + GIVEN an available volume without Name tag + WHEN check_unattached_volume is called + THEN a DELETE_VOLUME action should be returned (untagged volumes are targets) + """ + create_time = datetime.datetime.fromtimestamp( + current_time - 86400, # 1 day old + tz=datetime.timezone.utc + ) + volume = volume_builder.with_state("available").with_create_time(create_time).build() + # Manually remove Name tag if it was added + volume["Tags"] = [tag for tag in volume.get("Tags", []) if tag["Key"] != "Name"] + tags_dict = tags_dict_from_instance(volume) + + action = check_unattached_volume(volume, tags_dict, current_time) + + assert action is not None + assert action.action == "DELETE_VOLUME" + assert action.volume_id == "vol-test123456" + assert action.name == "" + assert action.billing_tag == "" + assert action.resource_type == "volume" + + +@pytest.mark.unit +@pytest.mark.volumes +class TestVolumeProtection: + """Test volume protection mechanisms.""" + + def test_volume_with_do_not_remove_in_name_is_protected( + self, volume_builder, tags_dict_from_instance + ): + """ + GIVEN a volume with 'do not remove' in Name tag + WHEN is_volume_protected is called + THEN True should be returned + """ + volume = volume_builder.with_name("jenkins-data, do not remove").build() + tags_dict = tags_dict_from_instance(volume) + + is_protected, reason = is_volume_protected(tags_dict, "vol-test123") + assert is_protected is True + assert "do not remove" in reason + + def test_volume_with_percona_keep_tag_is_protected( + self, volume_builder, tags_dict_from_instance + ): + """ + GIVEN a volume with PerconaKeep tag + WHEN is_volume_protected is called + THEN True should be returned + """ + volume = ( + volume_builder + .with_name("prod-volume") + .with_tag("PerconaKeep", "true") + .build() + ) + tags_dict = tags_dict_from_instance(volume) + + is_protected, reason = is_volume_protected(tags_dict, "vol-test123") + assert is_protected is True + assert "PerconaKeep" in reason + + def test_volume_with_persistent_billing_tag_is_protected( + self, volume_builder, tags_dict_from_instance + ): + """ + GIVEN a volume with persistent billing tag (e.g., jenkins-*) + WHEN is_volume_protected is called + THEN True should be returned + """ + volume = ( + volume_builder + .with_name("jenkins-volume") + .with_billing_tag("jenkins-dev-pmm") + .build() + ) + tags_dict = tags_dict_from_instance(volume) + + is_protected, reason = is_volume_protected(tags_dict, "vol-test123") + assert is_protected is True + assert "jenkins-dev-pmm" in reason + + def test_volume_with_valid_billing_tag_is_protected( + self, volume_builder, tags_dict_from_instance + ): + """ + GIVEN a volume with valid billing tag + WHEN is_volume_protected is called + THEN True should be returned + """ + volume = ( + volume_builder + .with_name("pmm-volume") + .with_billing_tag("pmm-staging") + .build() + ) + tags_dict = tags_dict_from_instance(volume) + + is_protected, reason = is_volume_protected(tags_dict, "vol-test123") + assert is_protected is True + assert "pmm-staging" in reason + + def test_unprotected_volume_returns_false( + self, volume_builder, tags_dict_from_instance + ): + """ + GIVEN a volume without protection mechanisms + WHEN is_volume_protected is called + THEN False should be returned + """ + volume = ( + volume_builder + .with_name("unprotected-volume") + .build() + ) + tags_dict = tags_dict_from_instance(volume) + + is_protected, reason = is_volume_protected(tags_dict, "vol-test123") + assert is_protected is False + assert reason == "" + + +@pytest.mark.unit +@pytest.mark.volumes +class TestVolumeDetectionIntegration: + """Test complete volume detection flow.""" + + def test_protected_available_volume_returns_none( + self, volume_builder, current_time, tags_dict_from_instance + ): + """ + GIVEN an available volume that is protected + WHEN check_unattached_volume is called + THEN None should be returned (protected from deletion) + """ + volume = ( + volume_builder + .with_name("jenkins-data, do not remove") + .with_state("available") + .with_billing_tag("jenkins-ps80") + .build() + ) + tags_dict = tags_dict_from_instance(volume) + + action = check_unattached_volume(volume, tags_dict, current_time) + + assert action is None + + def test_volume_age_calculation( + self, volume_builder, current_time, tags_dict_from_instance + ): + """ + GIVEN a volume created 7 days ago + WHEN check_unattached_volume is called + THEN days_overdue should be approximately 7 + """ + create_time = datetime.datetime.fromtimestamp( + current_time - (7 * 86400), # 7 days ago + tz=datetime.timezone.utc + ) + volume = ( + volume_builder + .with_name("old-volume") + .with_state("available") + .with_create_time(create_time) + .build() + ) + tags_dict = tags_dict_from_instance(volume) + + action = check_unattached_volume(volume, tags_dict, current_time) + + assert action is not None + assert 6.9 < action.days_overdue < 7.1 + + def test_volume_cleanup_action_contains_volume_metadata( + self, volume_builder, current_time, tags_dict_from_instance + ): + """ + GIVEN an available volume with size and type metadata + WHEN check_unattached_volume is called + THEN the reason should include volume size and type + """ + volume = ( + volume_builder + .with_name("large-volume") + .with_state("available") + .with_size(500) # 500GB + .build() + ) + volume["VolumeType"] = "io2" # High-performance volume + tags_dict = tags_dict_from_instance(volume) + + action = check_unattached_volume(volume, tags_dict, current_time) + + assert action is not None + assert "500GB" in action.reason + assert "io2" in action.reason diff --git a/IaC/cdk/aws-resources-cleanup/tests/unit/volumes/test_volume_execution.py b/IaC/cdk/aws-resources-cleanup/tests/unit/volumes/test_volume_execution.py new file mode 100644 index 0000000000..ede6727fea --- /dev/null +++ b/IaC/cdk/aws-resources-cleanup/tests/unit/volumes/test_volume_execution.py @@ -0,0 +1,197 @@ +"""Unit tests for EBS volume deletion execution logic. + +Tests the delete_volume() function with mocked AWS calls. +""" + +from __future__ import annotations +import pytest +from unittest.mock import MagicMock, patch +from botocore.exceptions import ClientError + +from aws_resource_cleanup.ec2.volumes import delete_volume +from aws_resource_cleanup.models import CleanupAction + + +@pytest.fixture +def volume_action(): + """Fixture for a volume cleanup action.""" + return CleanupAction( + instance_id="", + region="us-east-2", + name="test-volume", + action="DELETE_VOLUME", + reason="Unattached volume (10GB gp3, created 2025-01-01, 5.2 days old)", + days_overdue=5.2, + billing_tag="test-billing", + resource_type="volume", + volume_id="vol-test123456" + ) + + +@pytest.mark.unit +@pytest.mark.volumes +class TestVolumeExecution: + """Test volume deletion execution.""" + + @patch("aws_resource_cleanup.ec2.volumes.DRY_RUN", False) + @patch("aws_resource_cleanup.ec2.volumes.boto3") + def test_successful_volume_deletion(self, mock_boto3, volume_action): + """ + GIVEN a valid volume deletion action + WHEN delete_volume is called in LIVE mode + THEN the volume should be deleted successfully + """ + # Mock EC2 client + mock_ec2 = MagicMock() + mock_boto3.client.return_value = mock_ec2 + + # Mock describe_volumes response + mock_ec2.describe_volumes.return_value = { + "Volumes": [{ + "VolumeId": "vol-test123456", + "State": "available", + "Tags": [{"Key": "Name", "Value": "test-volume"}] + }] + } + + result = delete_volume(volume_action, "us-east-2") + + assert result is True + mock_ec2.delete_volume.assert_called_once_with(VolumeId="vol-test123456") + + @patch("aws_resource_cleanup.ec2.volumes.DRY_RUN", True) + @patch("aws_resource_cleanup.ec2.volumes.boto3") + def test_dry_run_mode_logs_without_deleting(self, mock_boto3, volume_action): + """ + GIVEN a volume deletion action + WHEN delete_volume is called in DRY_RUN mode + THEN no actual deletion should occur + """ + mock_ec2 = MagicMock() + mock_boto3.client.return_value = mock_ec2 + + result = delete_volume(volume_action, "us-east-2") + + assert result is True + mock_ec2.delete_volume.assert_not_called() + + @patch("aws_resource_cleanup.ec2.volumes.DRY_RUN", False) + @patch("aws_resource_cleanup.ec2.volumes.boto3") + def test_volume_in_use_skips_deletion(self, mock_boto3, volume_action): + """ + GIVEN a volume that is now in-use + WHEN delete_volume is called + THEN deletion should be skipped with warning + """ + mock_ec2 = MagicMock() + mock_boto3.client.return_value = mock_ec2 + + # Mock volume changed to in-use state + mock_ec2.describe_volumes.return_value = { + "Volumes": [{ + "VolumeId": "vol-test123456", + "State": "in-use", # Changed state + "Tags": [{"Key": "Name", "Value": "test-volume"}] + }] + } + + result = delete_volume(volume_action, "us-east-2") + + assert result is False + mock_ec2.delete_volume.assert_not_called() + + @patch("aws_resource_cleanup.ec2.volumes.DRY_RUN", False) + @patch("aws_resource_cleanup.ec2.volumes.boto3") + def test_volume_not_found_returns_false(self, mock_boto3, volume_action): + """ + GIVEN a volume that no longer exists + WHEN delete_volume is called + THEN False should be returned (already deleted) + """ + mock_ec2 = MagicMock() + mock_boto3.client.return_value = mock_ec2 + + # Mock volume not found + mock_ec2.describe_volumes.return_value = {"Volumes": []} + + result = delete_volume(volume_action, "us-east-2") + + assert result is False + mock_ec2.delete_volume.assert_not_called() + + @patch("aws_resource_cleanup.ec2.volumes.DRY_RUN", False) + @patch("aws_resource_cleanup.ec2.volumes.boto3") + def test_volume_deletion_handles_client_error(self, mock_boto3, volume_action): + """ + GIVEN a volume deletion that fails with ClientError + WHEN delete_volume is called + THEN False should be returned and error logged + """ + mock_ec2 = MagicMock() + mock_boto3.client.return_value = mock_ec2 + + # Mock successful describe but failed delete + mock_ec2.describe_volumes.return_value = { + "Volumes": [{ + "VolumeId": "vol-test123456", + "State": "available", + "Tags": [{"Key": "Name", "Value": "test-volume"}] + }] + } + + # Mock delete failure + error_response = {"Error": {"Code": "VolumeInUse", "Message": "Volume is in use"}} + mock_ec2.delete_volume.side_effect = ClientError(error_response, "DeleteVolume") + + result = delete_volume(volume_action, "us-east-2") + + assert result is False + + @patch("aws_resource_cleanup.ec2.volumes.DRY_RUN", False) + @patch("aws_resource_cleanup.ec2.volumes.boto3") + def test_volume_protection_check_before_deletion(self, mock_boto3, volume_action): + """ + GIVEN a volume that became protected after action was created + WHEN delete_volume is called + THEN deletion should be skipped + """ + mock_ec2 = MagicMock() + mock_boto3.client.return_value = mock_ec2 + + # Mock volume now has protection + mock_ec2.describe_volumes.return_value = { + "Volumes": [{ + "VolumeId": "vol-test123456", + "State": "available", + "Tags": [ + {"Key": "Name", "Value": "test-volume, do not remove"}, # Protected + {"Key": "iit-billing-tag", "Value": "test-billing"} + ] + }] + } + + result = delete_volume(volume_action, "us-east-2") + + assert result is False + mock_ec2.delete_volume.assert_not_called() + + def test_delete_volume_without_volume_id_returns_false(self): + """ + GIVEN an action without volume_id + WHEN delete_volume is called + THEN False should be returned immediately + """ + action_without_id = CleanupAction( + instance_id="", + region="us-east-2", + name="test", + action="DELETE_VOLUME", + reason="test", + days_overdue=1.0, + resource_type="volume", + volume_id=None # Missing volume_id + ) + + result = delete_volume(action_without_id, "us-east-2") + + assert result is False diff --git a/IaC/justfile b/IaC/justfile new file mode 100644 index 0000000000..c7245ef8c3 --- /dev/null +++ b/IaC/justfile @@ -0,0 +1,21 @@ +# Root IaC Justfile - Routes to CDK projects +# Usage: just + +# Default - show available projects +default: + @echo "Available CDK projects:" + @echo " aws-resources-cleanup - Comprehensive AWS resource cleanup Lambda" + @echo "" + @echo "Usage from IaC/:" + @echo " just aws-resources-cleanup Show project help" + @echo " just aws-resources-cleanup Run project command" + @echo "" + @echo "Common commands:" + @echo " just aws-resources-cleanup deploy Deploy in DRY_RUN mode" + @echo " just aws-resources-cleanup logs Tail CloudWatch logs" + @echo " just aws-resources-cleanup invoke-aws Test Lambda execution" + @echo " just aws-resources-cleanup update-code Fast Lambda code update" + +# AWS Resources Cleanup +aws-resources-cleanup *ARGS: + @cd cdk/aws-resources-cleanup && just {{ARGS}}