feat(teleport-terraform): core use cases, EKS control plane, PR verification#27
feat(teleport-terraform): core use cases, EKS control plane, PR verification#27tenaciousdlg wants to merge 62 commits intomainfrom
Conversation
…xamples and checks
- switch machine-id bot onboarding to bound_keypair with registration secret - strengthen smoke-test verification for machine-id and label-aware checks - make terraform-templates-check portable (no rg/mapfile) and ignore .terraform cache - add CI workflow for template validation and refresh control-plane/docs updates
…ess request Slack plugin New data-plane templates: - kubernetes-access-eks-autodiscovery: EKS auto-discovery via tag-based enrollment - server-access-ec2-autodiscovery: EC2 auto-discovery via SSM + IAM joining New modules: - kube-discovery-agent: EC2 agent running kubernetes_service + discovery_service - ec2-discovery-agent: Discovery Service agent with dual-token pattern (secret + IAM join) New profiles (multi-use-case compositions): - windows-mongodb-ssh: SSH + MongoDB + Windows Desktop (traditional enterprise) - cloud-native-apps: Grafana + HTTPBin + RDS MySQL + AWS Console (cloud-native) - full-platform: all use cases combined Control plane updates: - eks/3-rbac: add prod-reviewer role; grant to engineers access list - eks/4-plugins: new layer deploying teleport-plugin-slack via Helm for access request approvals Security hardening across all existing templates and modules: - Remove public IPs (associate_public_ip_address = false) - Enforce IMDSv2 (http_tokens = required) - Encrypt EBS root volumes - Fix token expiry (8h TTL, lifecycle ignore_changes on metadata) - Parameterize CIDRs Tooling: - tools/terraform-templates-check.sh: fmt + validate + optional plan + conftest - tools/policy/: OPA/Conftest policies (IMDSv2, EBS, public IPs, Teleport labels, IAM wildcards) - .pre-commit-config.yaml: fmt, validate, tflint, checkov hooks - .github/workflows/teleport-demo-deploy.yml: one-click deploy via workflow_dispatch - .github/workflows/teleport-demo-teardown.yml: scheduled and on-demand teardown All 36 templates pass terraform fmt -check and terraform validate.
…ck files .terraform.lock.hcl was added to the gitignore in the original codebase. This commit restores that intent, removing the lock files that were inadvertently committed in the previous commit.
…ed gitignore to ignore license files, feat(core); updated roles
…el app - Replace self-postgres, self-mysql, self-mongodb with a single parameterized self-database module (db_type variable) - Add Cassandra support to self-database with Java 11, PKCS12 TLS, AllowAllAuthenticator for mTLS-only access via Teleport - Add data-plane example for Cassandra self-managed - Add app-demo-panel module (Flask app behind Teleport App Access) that reads Teleport-Jwt-Assertion header and renders user identity; app code lives in a separate repo, deployed via git clone - Add data-plane example and full-platform profile support for demo panel Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bug fixes: - modules/desktop-service: security_groups → vpc_security_group_ids (custom VPC instances silently dropped the SG in Terraform AWS provider 5.x; broke outbound connectivity and prevented Teleport install) - modules/mcp-stdio-app: docker package removed from AL2023 standard repos in 2023.6+; switch to Docker CE via CentOS 9 repo with $releasever pinned to 9 (AL2023 resolves $releasever as "2023" which has no packages in Docker's repo) New profile: - profiles/dev-demo: Bob (dev) + dlg (engineer) day-in-the-life demo — 2 dev SSH nodes, 1 prod SSH node (access-request gated), PostgreSQL, MongoDB, Grafana, HTTPBin, Windows Desktop, MCP, Ansible READMEs (complete coverage — all data-plane, profiles, and modules now documented): - New: 4 profile READMEs, 2 module READMEs (self-database, app-demo-panel), 4 data-plane READMEs (demo-panel, cassandra, ec2-autodiscovery, eks-autodiscovery) - Updated: root README (testing status table, corrected module list), profiles/README (dev-demo added), desktop-access (troubleshooting + fix note), aws-console (fresh-account vs shared-account IAM guidance) - aws-console: added terraform.tfvars.example with manage_account_a_roles=true Other: - control-plane/eks/3-rbac/roles.tf: wildcard k8s labels, enable create_db_user/create_desktop_user - data-plane/machine-id-mcp/outputs.tf: remove stale outputs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-panel Sets the real published repo as the default so the template works out of the box without requiring app_repo in tfvars.
…VAR, add Demo Points All six templates now follow the same structure: - Deploy section uses export TF_VAR_... only (no tfvars.example reference) - Dedicated Access section with verified tsh commands - Demo Points section with prose suitable for SE use - Variables table Templates updated: grafana, httpbin, mongodb-self-managed, mysql-self-managed, postgres-self-managed, machine-id-mcp
…table demo Switches the MCP stdio server from the abstract test server (echo, add, sampleLLM tools) to mcp/filesystem, which exposes real file I/O tools (read_file, list_directory, etc.) against a demo directory on the host. Adds /demo-files/ to userdata: README, config/app.yaml, config/database.yaml, and logs/recent.log with realistic content that gives Claude something meaningful to explore during the demo. Updated across: data-plane, both profiles (dev-demo, full-platform), module README, and smoke-test.
Adds outputs.tf with connection_guide to all data-plane templates that
lacked one, and improves existing outputs:
- application-access-{demo-panel,grafana,httpbin}: new outputs.tf
- database-access-{cassandra,mongodb,mysql,postgres}-self-managed: new outputs.tf
- database-access-rds-mysql: new outputs.tf, remove inline outputs from main.tf
- machine-id-{ansible,mcp}: new/updated outputs.tf with connection_guide
- server-access-ssh-getting-started: new outputs.tf
- application-access-aws-console: replace garbled jsonencode trust policy
outputs with readable connection_guide + principal snippet
- desktop-access-windows-local: replace private IP outputs with
web UI connection instructions
Also fixes two README accuracy bugs:
- mysql-self-managed: demo-mysql → mysql-<env> (actual resource name)
- postgres-self-managed: demo-postgres → postgres-<env>
machine-id-ansible README standardized to match other templates.
GitHub Actions:
- Deploy workflow: replace free-text use_cases with profile dropdown
(dev-demo, full-platform, windows-mongodb-ssh, cloud-native-apps)
- Teardown workflow: loop over profiles/ instead of data-plane/
- State keys updated to teleport-demo/{env}/{profile}/terraform.tfstate
- full-platform/variables.tf: add defaults for demo_panel_app_repo and
confirm console_role_arns default so teardown works without explicit input
RBAC (3-rbac/roles.tf):
- dev-auto-access + prod-auto-access: add db_users with email trait templates
(was missing — auto user provisioning silently failed without it)
- platform-dev-access + prod-access: enable create_db_user_mode=keep and
add trait-based db_users so IAM-auth RDS DBs work alongside self-hosted DBs
Profiles:
- dev-demo: add outputs.tf with connection_guide covering all 7 resource types
- dev-demo README: remove deprecated cp terraform.tfvars.example instruction
- All profile outputs.tf: add Teleport version to connection_guide header
Templates:
- README: expand GitHub Actions section (setup, secrets table, deploy/destroy)
- README: mark database-access-rds-mysql as tested
- rds-mysql outputs.tf: clarify db-user is your Teleport username (not reader/writer)
- terraform-templates-check.sh: add completeness pass (README.md + outputs.tf)
Userdata templates (all modules):
- Standardize set -euxo pipefail on all scripts
- Write token to /tmp/token file; reference by path in teleport.yaml
instead of inline interpolation (avoids shell injection risk)
- Pin versioned enterprise install: bash -s "${teleport_version}" enterprise
- Normalize enabled: "yes"/"no" strings (no bare booleans)
ssh-node module:
- Add missing teleport_version variable and wire into templatefile call
- Update all callers: dev-demo, full-platform, windows-mongodb-ssh profiles
and server-access-ssh-getting-started data-plane example
kube-discovery-agent module:
- Add team label to ssh_service so RBAC node label matching works
- Add teleport_version variable; wire into templatefile call
- Fix enabled: false → "no" in proxy/auth service blocks
ec2-discovery-agent module:
- Add NAT wait loop before install (mirrors kube-discovery-agent pattern)
- Wire teleport_version into templatefile call
3-rbac/roles.tf:
- Add host_groups = ["wheel"] to all roles with create_host_user_mode = "keep"
(dev-access, dev-auto-access, platform-dev-access, prod-access, prod-auto-access)
Fixes "host user creation not authorized" — host_groups is required for
IsHostUserCreationAllowed() to return true
kubernetes-access-eks-autodiscovery:
- Add teleport_version variable to support versioned installs
Agentless OpenSSH nodes require tctl auth sign to generate a host certificate signed by the Teleport host CA. This cannot be automated declaratively (no exec provisioners in this project), so the template cannot self-bootstrap. Out of scope for now.
- docs/manual-steps-runbook.md: covers all manual steps outside Terraform (session auth, Okta SCIM credentials, Okta Push Groups, Slack bot invite, GitHub Actions CI bot) - modules/self-database/tests/validation.tftest.hcl: terraform test for db_type validation — all four valid types pass, invalid type rejected - tools/terraform-templates-check.sh: add terraform test step for modules with .tftest.hcl files; fix completeness check grep to correctly exclude modules/ subdirectories (was only excluding the parent /modules dir) - control-plane/*/README.md: add brief layer READMEs for all numbered layers missing them (standalone, proxy-peer, cloud, eks 1-cluster through 3-rbac) - control-plane/*/outputs.tf: add outputs.tf for layers that had outputs embedded in main.tf or had none; move embedded outputs to dedicated files
…re tools README, add workflows README
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
There was a problem hiding this comment.
wondering if all the smoke tests and such should just be fresh GHAs that only trigger on changes in the TF framework? cc @webvictim
There was a problem hiding this comment.
Yeah, I like the idea of the smoke tests having to run and succeed before changes can be committed here - just a question of how we get sufficient permissions for GHA to spawn everything and check it. I think it can be a follow-up PR or one for @jtarang :)
kurktchiev
left a comment
There was a problem hiding this comment.
i am assuming this stack is driving current Roadshow Demo env?
Prevents race condition where instances boot and run cloud-init before the NAT gateway and route table associations are fully provisioned, causing the Teleport install script curl to time out.
…liminate race condition Add create_nat_gateway variable (default: true for backward compat). When false, route table uses the internet gateway directly and subnet_id returns the public subnet so instances get outbound internet access without a NAT gateway (~$32/mo savings). All EC2 modules now use associate_public_ip_address = null so public subnet placement automatically assigns a public IP via the subnet's map_public_ip_on_launch setting. application-access-aws-console defaults create_nat_gateway = false since the app host only needs outbound proxy connectivity and benefits most from the cost reduction.
Control plane logs cost ~$27/mo and are not needed for demo environments.
teleport-update enable (used internally by install.sh) accepts no positional args.
Passing "${teleport_version} enterprise" caused "unexpected 18.7.2" errors in 18.7.x.
The install.sh fetched from the proxy auto-installs the cluster-advertised version and
edition, so no args are needed. Fixes cloud-init failure on all EC2-based templates.
| ├── 1-cluster/ # EKS infrastructure (stable, rarely changed) | ||
| ├── 2-teleport/ # Teleport deployment + supporting AWS/K8s resources | ||
| ├── 3-rbac/ # SAML/login rules, roles, and demo apps | ||
| └── update-teleport.sh |
There was a problem hiding this comment.
The update-teleport.sh isn't here and this doesn't include the other two dirs when this can be updated.
Co-authored-by: Steven Martin <steven@goteleport.com>
webvictim
left a comment
There was a problem hiding this comment.
Great job overall.
Lots of comments inline but overall:
- a number of these data-plane files have values in saying "Change this" but we don't call that out in the base README - if something needs to be changed for the deployment to work, it should be set up front
- this is a big repo and most folks probably aren't going to read all of it. The main README should cover exactly how to do the most basic things.
- I misunderstood the distinction between what was under
data-planeand what was undermodulesat first (the PR review screen renders the main README right at the bottom, go figure)- In an ideal world I think maybe these parts should be separated so they're easier to follow - there's a lot of duplication between them at the moment and it isn't 100% clear when you'd want to use one vs the other.
| # | ||
| # REQUIRED SECRETS (Settings → Secrets and variables → Actions): | ||
| # AWS_ROLE_ARN - IAM role ARN to assume via OIDC (recommended) or | ||
| # AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY (fallback) |
There was a problem hiding this comment.
We should heavily discourage putting secrets make it impossible to put secrets into Github actions like this, IMO. It's just a recipe for accidental leakage at some point. The OIDC flow can teach people how to do this properly.
| "arn:aws:s3:::presales-teleport-demo-tfstate", | ||
| "arn:aws:s3:::presales-teleport-demo-tfstate/*" |
There was a problem hiding this comment.
I think these should be templated or otherwise "put-your-own-bucket-name-here", right? Are we assuming every SE will use their own bucket to prevent permissions stomping/issues?
| transition { | ||
| days = 90 | ||
| storage_class = "GLACIER" | ||
| } |
There was a problem hiding this comment.
Is Glacier the best idea here given its slow retrieval times? I can imagine in the case of a compliance emergency it might be a problem. Maybe we're better off just leaving everything on standard storage for a year and then cleaning it up.
| @@ -0,0 +1,11 @@ | |||
| # 2-teleport/terraform.tfvars.example | |||
| # Copy to terraform.tfvars and fill in your values. | |||
There was a problem hiding this comment.
If we're suggesting that people use TF_VAR_* we should push that here too.
| @@ -0,0 +1,19 @@ | |||
| # 3-rbac/terraform.tfvars.example | |||
| # Copy to terraform.tfvars and fill in your values. | |||
There was a problem hiding this comment.
If we're suggesting that people use TF_VAR_* we should push that here too.
| @@ -0,0 +1,6 @@ | |||
| proxy_address = "teleport.example.com" | |||
| @@ -0,0 +1,13 @@ | |||
| proxy_address = "teleport.example.com" | |||
|
|
||
| Profiles compose multiple data-plane use cases into a single Terraform root module for common prospect archetypes. Instead of deploying and managing individual templates, one `terraform apply` stands up an entire scenario. | ||
|
|
||
| **Key difference vs. data-plane templates:** Profiles share a single VPC across all use cases. Individual data-plane templates each create their own VPC (useful for isolation, one-feature demos). Profiles trade isolation for simplicity — one network, one state file, one `terraform destroy`. |
There was a problem hiding this comment.
This PR is telling me the answers as I read it.
Overall I think we ought to pick a lane. If the data plane use cases are single-VPC isolated setups then move them into a separate repo. These profile-type setups seem more useful for day-to-day demo stuff.
| 2. Bob SSHs to a dev node — Teleport creates a host user dynamically | ||
| 3. Bob connects to postgres-dev via `tsh db connect` — no password | ||
| 4. Bob submits an access request for prod access | ||
| 5. $USER approves in Slack — prod-server appears in Bob's `tsh ls` |
There was a problem hiding this comment.
$USER or engineer or another human name?
There was a problem hiding this comment.
Yeah, I like the idea of the smoke tests having to run and succeed before changes can be committed here - just a question of how we get sufficient permissions for GHA to spawn everything and check it. I think it can be a follow-up PR or one for @jtarang :)
Summary
application-access-demo-panel,database-access-cassandra-self-managed,kubernetes-access-eks-autodiscoverymodules/self-database— single parameterized module replacing three separate DB modules, extended with Cassandra supportmodules/teleport-rbacused across all control-plane variants; join_sessions, autoupdate_config, enhanced recording consistent everywhereparent_domain→domain_namerename in standalone, BPF buffer sizes on SSH nodes, fixedterraform-providerrole exclusion from engineers access listterraform testforself-databasedb_type validation, fixed completeness check (was not excluding module subdirs), layer READMEs +outputs.tffor all control-plane numbered layersterraform.tfvars.exampleadded for all templates missing them,cdpaths corrected to be relative totemplates/teleport-terraform/, hardcodeddlgusername replaced withengineer@example.comthroughoutserver-access-agentless-openssh— host cert automation requires exec provisioner, out of scopeTested
server-access-ssh-getting-startedapplication-access-grafana,httpbin,aws-console,demo-paneldatabase-access-postgres,mysql,mongodb,cassandra-self-manageddatabase-access-rds-mysqldesktop-access-windows-localmachine-id-ansible,machine-id-mcpkubernetes-access-eks-autodiscoveryprofiles/dev-demo,profiles/full-platformcontrol-plane/eks(layers 1–5)Verification
29 templates validate clean, 5/5 unit tests pass.
Manual steps outside Terraform
See
templates/teleport-terraform/docs/manual-steps-runbook.md. Summary:tsh login+eval $(tctl terraform env)each sessiontctl plugins install scim/invite @<bot>tctl bots add github-ci+ join token (seedocs/github-actions-setup.md)