Skip to content

feat(teleport-terraform): core use cases, EKS control plane, PR verification#27

Open
tenaciousdlg wants to merge 62 commits intomainfrom
feat/teleport-terraform-core-use-cases
Open

feat(teleport-terraform): core use cases, EKS control plane, PR verification#27
tenaciousdlg wants to merge 62 commits intomainfrom
feat/teleport-terraform-core-use-cases

Conversation

@tenaciousdlg
Copy link
Copy Markdown
Contributor

Summary

  • EKS control plane (5 layers): cluster, Teleport Helm install, RBAC, Slack plugin, Access Graph (Identity Security via RDS Aurora Serverless v2 + Helm)
  • Standalone control plane: single-node EC2 Teleport cluster with Route 53 + ACM
  • Data plane additions: application-access-demo-panel, database-access-cassandra-self-managed, kubernetes-access-eks-autodiscovery
  • Module consolidation: modules/self-database — single parameterized module replacing three separate DB modules, extended with Cassandra support
  • Shared RBAC module: modules/teleport-rbac used across all control-plane variants; join_sessions, autoupdate_config, enhanced recording consistent everywhere
  • Bug fixes: AL2023 AMI EBS minimum (all root volumes → 30GB), parent_domaindomain_name rename in standalone, BPF buffer sizes on SSH nodes, fixed terraform-provider role exclusion from engineers access list
  • PR verification: manual-steps runbook, terraform test for self-database db_type validation, fixed completeness check (was not excluding module subdirs), layer READMEs + outputs.tf for all control-plane numbered layers
  • Hygiene: terraform.tfvars.example added for all templates missing them, cd paths corrected to be relative to templates/teleport-terraform/, hardcoded dlg username replaced with engineer@example.com throughout
  • Removed: server-access-agentless-openssh — host cert automation requires exec provisioner, out of scope

Tested

Template / Profile Status
server-access-ssh-getting-started
application-access-grafana, httpbin, aws-console, demo-panel
database-access-postgres, mysql, mongodb, cassandra-self-managed
database-access-rds-mysql
desktop-access-windows-local
machine-id-ansible, machine-id-mcp
kubernetes-access-eks-autodiscovery
profiles/dev-demo, profiles/full-platform
control-plane/eks (layers 1–5)

Verification

# Static checks — no credentials needed (fmt, completeness, validate, terraform test)
cd templates/teleport-terraform
./tools/terraform-templates-check.sh

29 templates validate clean, 5/5 unit tests pass.

Manual steps outside Terraform

See templates/teleport-terraform/docs/manual-steps-runbook.md. Summary:

  • tsh login + eval $(tctl terraform env) each session
  • Okta SCIM credentials entered in Okta UI after tctl plugins install scim
  • Okta Push Groups configured manually (no Okta TF provider resource for this)
  • Slack bot invited to channel via /invite @<bot>
  • GitHub Actions CI bot: tctl bots add github-ci + join token (see docs/github-actions-setup.md)

tenaciousdlg and others added 30 commits January 15, 2026 23:46
- switch machine-id bot onboarding to bound_keypair with registration secret

- strengthen smoke-test verification for machine-id and label-aware checks

- make terraform-templates-check portable (no rg/mapfile) and ignore .terraform cache

- add CI workflow for template validation and refresh control-plane/docs updates
…ess request Slack plugin

New data-plane templates:
- kubernetes-access-eks-autodiscovery: EKS auto-discovery via tag-based enrollment
- server-access-ec2-autodiscovery: EC2 auto-discovery via SSM + IAM joining

New modules:
- kube-discovery-agent: EC2 agent running kubernetes_service + discovery_service
- ec2-discovery-agent: Discovery Service agent with dual-token pattern (secret + IAM join)

New profiles (multi-use-case compositions):
- windows-mongodb-ssh: SSH + MongoDB + Windows Desktop (traditional enterprise)
- cloud-native-apps: Grafana + HTTPBin + RDS MySQL + AWS Console (cloud-native)
- full-platform: all use cases combined

Control plane updates:
- eks/3-rbac: add prod-reviewer role; grant to engineers access list
- eks/4-plugins: new layer deploying teleport-plugin-slack via Helm for access request approvals

Security hardening across all existing templates and modules:
- Remove public IPs (associate_public_ip_address = false)
- Enforce IMDSv2 (http_tokens = required)
- Encrypt EBS root volumes
- Fix token expiry (8h TTL, lifecycle ignore_changes on metadata)
- Parameterize CIDRs

Tooling:
- tools/terraform-templates-check.sh: fmt + validate + optional plan + conftest
- tools/policy/: OPA/Conftest policies (IMDSv2, EBS, public IPs, Teleport labels, IAM wildcards)
- .pre-commit-config.yaml: fmt, validate, tflint, checkov hooks
- .github/workflows/teleport-demo-deploy.yml: one-click deploy via workflow_dispatch
- .github/workflows/teleport-demo-teardown.yml: scheduled and on-demand teardown

All 36 templates pass terraform fmt -check and terraform validate.
…ck files

.terraform.lock.hcl was added to the gitignore in the original codebase.
This commit restores that intent, removing the lock files that were
inadvertently committed in the previous commit.
…ed gitignore to ignore license files, feat(core); updated roles
…el app

- Replace self-postgres, self-mysql, self-mongodb with a single
  parameterized self-database module (db_type variable)
- Add Cassandra support to self-database with Java 11, PKCS12 TLS,
  AllowAllAuthenticator for mTLS-only access via Teleport
- Add data-plane example for Cassandra self-managed
- Add app-demo-panel module (Flask app behind Teleport App Access)
  that reads Teleport-Jwt-Assertion header and renders user identity;
  app code lives in a separate repo, deployed via git clone
- Add data-plane example and full-platform profile support for demo panel

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bug fixes:
- modules/desktop-service: security_groups → vpc_security_group_ids (custom VPC
  instances silently dropped the SG in Terraform AWS provider 5.x; broke outbound
  connectivity and prevented Teleport install)
- modules/mcp-stdio-app: docker package removed from AL2023 standard repos in
  2023.6+; switch to Docker CE via CentOS 9 repo with $releasever pinned to 9
  (AL2023 resolves $releasever as "2023" which has no packages in Docker's repo)

New profile:
- profiles/dev-demo: Bob (dev) + dlg (engineer) day-in-the-life demo — 2 dev SSH
  nodes, 1 prod SSH node (access-request gated), PostgreSQL, MongoDB, Grafana,
  HTTPBin, Windows Desktop, MCP, Ansible

READMEs (complete coverage — all data-plane, profiles, and modules now documented):
- New: 4 profile READMEs, 2 module READMEs (self-database, app-demo-panel),
  4 data-plane READMEs (demo-panel, cassandra, ec2-autodiscovery, eks-autodiscovery)
- Updated: root README (testing status table, corrected module list),
  profiles/README (dev-demo added), desktop-access (troubleshooting + fix note),
  aws-console (fresh-account vs shared-account IAM guidance)
- aws-console: added terraform.tfvars.example with manage_account_a_roles=true

Other:
- control-plane/eks/3-rbac/roles.tf: wildcard k8s labels, enable
  create_db_user/create_desktop_user
- data-plane/machine-id-mcp/outputs.tf: remove stale outputs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-panel

Sets the real published repo as the default so the template works
out of the box without requiring app_repo in tfvars.
…VAR, add Demo Points

All six templates now follow the same structure:
- Deploy section uses export TF_VAR_... only (no tfvars.example reference)
- Dedicated Access section with verified tsh commands
- Demo Points section with prose suitable for SE use
- Variables table

Templates updated: grafana, httpbin, mongodb-self-managed, mysql-self-managed,
postgres-self-managed, machine-id-mcp
…table demo

Switches the MCP stdio server from the abstract test server (echo, add,
sampleLLM tools) to mcp/filesystem, which exposes real file I/O tools
(read_file, list_directory, etc.) against a demo directory on the host.

Adds /demo-files/ to userdata: README, config/app.yaml,
config/database.yaml, and logs/recent.log with realistic content that
gives Claude something meaningful to explore during the demo.

Updated across: data-plane, both profiles (dev-demo, full-platform),
module README, and smoke-test.
Adds outputs.tf with connection_guide to all data-plane templates that
lacked one, and improves existing outputs:

- application-access-{demo-panel,grafana,httpbin}: new outputs.tf
- database-access-{cassandra,mongodb,mysql,postgres}-self-managed: new outputs.tf
- database-access-rds-mysql: new outputs.tf, remove inline outputs from main.tf
- machine-id-{ansible,mcp}: new/updated outputs.tf with connection_guide
- server-access-ssh-getting-started: new outputs.tf
- application-access-aws-console: replace garbled jsonencode trust policy
  outputs with readable connection_guide + principal snippet
- desktop-access-windows-local: replace private IP outputs with
  web UI connection instructions

Also fixes two README accuracy bugs:
- mysql-self-managed: demo-mysql → mysql-<env> (actual resource name)
- postgres-self-managed: demo-postgres → postgres-<env>

machine-id-ansible README standardized to match other templates.
GitHub Actions:
- Deploy workflow: replace free-text use_cases with profile dropdown
  (dev-demo, full-platform, windows-mongodb-ssh, cloud-native-apps)
- Teardown workflow: loop over profiles/ instead of data-plane/
- State keys updated to teleport-demo/{env}/{profile}/terraform.tfstate
- full-platform/variables.tf: add defaults for demo_panel_app_repo and
  confirm console_role_arns default so teardown works without explicit input

RBAC (3-rbac/roles.tf):
- dev-auto-access + prod-auto-access: add db_users with email trait templates
  (was missing — auto user provisioning silently failed without it)
- platform-dev-access + prod-access: enable create_db_user_mode=keep and
  add trait-based db_users so IAM-auth RDS DBs work alongside self-hosted DBs

Profiles:
- dev-demo: add outputs.tf with connection_guide covering all 7 resource types
- dev-demo README: remove deprecated cp terraform.tfvars.example instruction
- All profile outputs.tf: add Teleport version to connection_guide header

Templates:
- README: expand GitHub Actions section (setup, secrets table, deploy/destroy)
- README: mark database-access-rds-mysql as tested
- rds-mysql outputs.tf: clarify db-user is your Teleport username (not reader/writer)
- terraform-templates-check.sh: add completeness pass (README.md + outputs.tf)
Userdata templates (all modules):
- Standardize set -euxo pipefail on all scripts
- Write token to /tmp/token file; reference by path in teleport.yaml
  instead of inline interpolation (avoids shell injection risk)
- Pin versioned enterprise install: bash -s "${teleport_version}" enterprise
- Normalize enabled: "yes"/"no" strings (no bare booleans)

ssh-node module:
- Add missing teleport_version variable and wire into templatefile call
- Update all callers: dev-demo, full-platform, windows-mongodb-ssh profiles
  and server-access-ssh-getting-started data-plane example

kube-discovery-agent module:
- Add team label to ssh_service so RBAC node label matching works
- Add teleport_version variable; wire into templatefile call
- Fix enabled: false → "no" in proxy/auth service blocks

ec2-discovery-agent module:
- Add NAT wait loop before install (mirrors kube-discovery-agent pattern)
- Wire teleport_version into templatefile call

3-rbac/roles.tf:
- Add host_groups = ["wheel"] to all roles with create_host_user_mode = "keep"
  (dev-access, dev-auto-access, platform-dev-access, prod-access, prod-auto-access)
  Fixes "host user creation not authorized" — host_groups is required for
  IsHostUserCreationAllowed() to return true

kubernetes-access-eks-autodiscovery:
- Add teleport_version variable to support versioned installs
Agentless OpenSSH nodes require tctl auth sign to generate a host
certificate signed by the Teleport host CA. This cannot be automated
declaratively (no exec provisioners in this project), so the template
cannot self-bootstrap. Out of scope for now.
- docs/manual-steps-runbook.md: covers all manual steps outside Terraform
  (session auth, Okta SCIM credentials, Okta Push Groups, Slack bot invite,
  GitHub Actions CI bot)
- modules/self-database/tests/validation.tftest.hcl: terraform test for
  db_type validation — all four valid types pass, invalid type rejected
- tools/terraform-templates-check.sh: add terraform test step for modules
  with .tftest.hcl files; fix completeness check grep to correctly exclude
  modules/ subdirectories (was only excluding the parent /modules dir)
- control-plane/*/README.md: add brief layer READMEs for all numbered layers
  missing them (standalone, proxy-peer, cloud, eks 1-cluster through 3-rbac)
- control-plane/*/outputs.tf: add outputs.tf for layers that had outputs
  embedded in main.tf or had none; move embedded outputs to dedicated files
Comment thread .github/workflows/teleport-demo-deploy.yml Fixed
Comment thread .github/workflows/teleport-demo-deploy.yml Fixed
Comment thread .github/workflows/teleport-demo-teardown.yml Fixed
Comment thread .github/workflows/teleport-demo-teardown.yml Fixed
Comment thread .github/workflows/teleport-demo-teardown.yml Fixed
Comment thread .github/workflows/terraform-templates.yml Fixed
Comment thread .github/workflows/terraform-templates.yml Fixed
Comment thread tools/README.md
Comment thread docs/github-actions-setup.md
@socket-security
Copy link
Copy Markdown

socket-security Bot commented Mar 12, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if all the smoke tests and such should just be fresh GHAs that only trigger on changes in the TF framework? cc @webvictim

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I like the idea of the smoke tests having to run and succeed before changes can be committed here - just a question of how we get sufficient permissions for GHA to spawn everything and check it. I think it can be a follow-up PR or one for @jtarang :)

Copy link
Copy Markdown
Collaborator

@kurktchiev kurktchiev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am assuming this stack is driving current Roadshow Demo env?

Prevents race condition where instances boot and run cloud-init before the
NAT gateway and route table associations are fully provisioned, causing
the Teleport install script curl to time out.
…liminate race condition

Add create_nat_gateway variable (default: true for backward compat). When false,
route table uses the internet gateway directly and subnet_id returns the public
subnet so instances get outbound internet access without a NAT gateway (~$32/mo savings).

All EC2 modules now use associate_public_ip_address = null so public subnet placement
automatically assigns a public IP via the subnet's map_public_ip_on_launch setting.

application-access-aws-console defaults create_nat_gateway = false since the app host
only needs outbound proxy connectivity and benefits most from the cost reduction.
Control plane logs cost ~$27/mo and are not needed for demo environments.
teleport-update enable (used internally by install.sh) accepts no positional args.
Passing "${teleport_version} enterprise" caused "unexpected 18.7.2" errors in 18.7.x.

The install.sh fetched from the proxy auto-installs the cluster-advertised version and
edition, so no args are needed. Fixes cloud-init failure on all EC2-based templates.
Comment thread templates/teleport-terraform/README.md
├── 1-cluster/ # EKS infrastructure (stable, rarely changed)
├── 2-teleport/ # Teleport deployment + supporting AWS/K8s resources
├── 3-rbac/ # SAML/login rules, roles, and demo apps
└── update-teleport.sh
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The update-teleport.sh isn't here and this doesn't include the other two dirs when this can be updated.

Co-authored-by: Steven Martin <steven@goteleport.com>
Copy link
Copy Markdown
Collaborator

@webvictim webvictim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job overall.

Lots of comments inline but overall:

  • a number of these data-plane files have values in saying "Change this" but we don't call that out in the base README - if something needs to be changed for the deployment to work, it should be set up front
  • this is a big repo and most folks probably aren't going to read all of it. The main README should cover exactly how to do the most basic things.
  • I misunderstood the distinction between what was under data-plane and what was under modules at first (the PR review screen renders the main README right at the bottom, go figure)
    • In an ideal world I think maybe these parts should be separated so they're easier to follow - there's a lot of duplication between them at the moment and it isn't 100% clear when you'd want to use one vs the other.

#
# REQUIRED SECRETS (Settings → Secrets and variables → Actions):
# AWS_ROLE_ARN - IAM role ARN to assume via OIDC (recommended) or
# AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY (fallback)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should heavily discourage putting secrets make it impossible to put secrets into Github actions like this, IMO. It's just a recipe for accidental leakage at some point. The OIDC flow can teach people how to do this properly.

Comment on lines +76 to +77
"arn:aws:s3:::presales-teleport-demo-tfstate",
"arn:aws:s3:::presales-teleport-demo-tfstate/*"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should be templated or otherwise "put-your-own-bucket-name-here", right? Are we assuming every SE will use their own bucket to prevent permissions stomping/issues?

Comment on lines +126 to +129
transition {
days = 90
storage_class = "GLACIER"
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Glacier the best idea here given its slow retrieval times? I can imagine in the case of a compliance emergency it might be a problem. Maybe we're better off just leaving everything on standard storage for a year and then cleaning it up.

@@ -0,0 +1,11 @@
# 2-teleport/terraform.tfvars.example
# Copy to terraform.tfvars and fill in your values.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're suggesting that people use TF_VAR_* we should push that here too.

@@ -0,0 +1,19 @@
# 3-rbac/terraform.tfvars.example
# Copy to terraform.tfvars and fill in your values.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're suggesting that people use TF_VAR_* we should push that here too.

@@ -0,0 +1,6 @@
proxy_address = "teleport.example.com"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.tfvars or TF_VAR_*?

@@ -0,0 +1,13 @@
proxy_address = "teleport.example.com"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.tfvars or TF_VAR_*?


Profiles compose multiple data-plane use cases into a single Terraform root module for common prospect archetypes. Instead of deploying and managing individual templates, one `terraform apply` stands up an entire scenario.

**Key difference vs. data-plane templates:** Profiles share a single VPC across all use cases. Individual data-plane templates each create their own VPC (useful for isolation, one-feature demos). Profiles trade isolation for simplicity — one network, one state file, one `terraform destroy`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is telling me the answers as I read it.

Overall I think we ought to pick a lane. If the data plane use cases are single-VPC isolated setups then move them into a separate repo. These profile-type setups seem more useful for day-to-day demo stuff.

2. Bob SSHs to a dev node — Teleport creates a host user dynamically
3. Bob connects to postgres-dev via `tsh db connect` — no password
4. Bob submits an access request for prod access
5. $USER approves in Slack — prod-server appears in Bob's `tsh ls`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$USER or engineer or another human name?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I like the idea of the smoke tests having to run and succeed before changes can be committed here - just a question of how we get sufficient permissions for GHA to spawn everything and check it. I think it can be a follow-up PR or one for @jtarang :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants