This project currently contains a working OTP flow for a Go backend service. The implementation has been built incrementally with small, reviewable slices using ChatGPT for architecture/review and Codex for focused implementation.
The OTP subsystem currently supports:
- OTP send
- OTP verify
- Redis-backed OTP state
- tenant settings lookup with Redis cache + PostgreSQL fallback
- fake SMS provider
- dev-only fake SMS OTP code capture
- PostgreSQL request logging
- PostgreSQL verification logging
- resend protection while an OTP is active
- separated OTP validity TTL and resend cooldown
- Redis/Lua atomic OTP reservation for first send
- Redis/Lua atomic OTP replacement for eligible resend
- request-ID-conditional verify mutations
- resend cooldown
Retry-After - per tenant + phone OTP send rate limiting
- Redis fixed-window send limiter
- Redis token-bucket send limiter
- Redis mixed send limiter for
tenant=token_bucketandphone=fixed_window - low-cardinality OTP send outcome metrics
- refined rate-limit metric reasons for phone, tenant, and both dimensions
- HTTP handlers and routes
- runtime wiring in
cmd/server/main.go - env-driven OTP configuration
- focused unit and integration-style tests
Current high-level flow:
HTTP API
-> internal/api handlers
-> internal/otp.Service
-> ports/interfaces
-> Redis/PostgreSQL/SMS adapters
Main components:
internal/api
Thin HTTP handlers and route registration.
internal/otp
Domain/application service for SendOTP and VerifyOTP.
Owns core OTP orchestration, validation, hashing, retry/attempt rules, and domain errors.
internal/repository
PostgreSQL and Redis adapters:
- tenant settings repository
- cached tenant settings provider
- Redis OTP store
- Redis send rate limiter
- OTP request log repository
- OTP verification log repository
internal/sms
Fake SMS provider used for local/dev and simulated provider behavior.
internal/config
Env-driven runtime configuration for OTP, fake SMS, and send rate limiting.
cmd/server/main.go
Runtime wiring for database, Redis, repositories, OTP service, SMS provider, and routes.
Implemented:
- OTP domain request/response models
- tenant settings model
- OTP state model
- SMS request/result models
- request/provider/verification log models
- domain errors
- interfaces/ports
- OTP config defaults
- OTP hashing helpers
- configurable numeric OTP generation
Implemented:
- dynamic numeric OTP generation
- backward-compatible 6-digit generator
- SHA-256 based hash helper
- constant-time code verification helper
- no plaintext OTP persistence in the main OTP state
Important behavior:
Redis OTP state stores code_hash only.
Plaintext OTP is not stored in the main OTP key.
Implemented:
- Redis-backed OTP state store
- key format:
otp:{tenant_id}:{phone} - Redis Hash storage
- Save/Get/Delete
- atomic CreateIfAbsent using Redis Lua
- atomic ReplaceIfRequestID using Redis Lua
- request-ID-conditional DeleteIfRequestID using Redis Lua
- request-ID-conditional IncrementAttemptsIfRequestID using Redis Lua
- atomic IncrementAttempts using Redis Lua for compatibility paths
- TTL-based expiration
- resend cooldown metadata based on Redis server time
- malformed state detection
- integration-style Redis tests
Stored fields:
request_id
tenant_id
phone
code_hash
attempt_count
max_attempts
created_at
expires_at
resend_available_at_ms
Important behavior:
First send creates OTP state only if no active state exists.
Eligible resend replaces OTP state only if the observed request_id is still current.
Verify mutations only affect the state if request_id still matches.
Implemented:
- Redis cache-aside provider for tenant settings
- PostgreSQL fallback
- Redis key format:
tenant:{tenant_id}:settings - stores only OTP-domain tenant settings subset
- avoids caching sensitive/unneeded DB fields
- falls back on malformed cache
- source errors returned when PostgreSQL lookup fails
Implemented:
- fake SMS provider implementing OTP SMS provider interface
- configurable latency
- default delay:
20ms to 30ms - context cancellation/timeout support
- safe SMS result
- no OTP code in RawResponse
- dev-only Redis debug code capture
Implemented for local/manual testing only.
Behavior:
- disabled by default
- enabled through config/env
- only active outside release mode
- stores plaintext OTP in a separate Redis debug key
- does not expose code in API response
- does not write code into
otp_requests - does not write code into normal OTP state
- does not log the code
Debug key format:
debug:otp-code:{tenant_id}:{phone}
Implemented behavior:
- validate request
- normalize phone
- load tenant settings
- validate tenant
- load existing OTP state
- block resend if active state exists and resend cooldown has not elapsed
- check optional send rate limiter
- generate request ID
- generate OTP code
- hash OTP code
- create OTP request log
- create OTP state atomically with
CreateIfAbsentfor first send - replace OTP state atomically with
ReplaceIfRequestIDfor eligible resend - send SMS through provider only after successful Redis reservation/replacement
- update provider result log
- return request ID and expiration
Important details:
- phone normalization happens before Redis keys, logs, limiter identity, and SMS send
- tenant validation happens before Redis OTP state check
- resend cooldown check happens before send rate limiting
- eligible resend still passes through send rate limiting
- blocked cooldown resend does not create request log
- rate-limited send does not create request log
- SMS is sent only after atomic OTP state reservation or replacement succeeds
- SMS provider failure is mapped to domain provider failure
- request logging is mandatory for send lifecycle
- request ID changes on every successful resend
- successful resend invalidates the previous OTP code
Implemented behavior:
- validate request
- normalize phone
- load OTP state from Redis
- handle not found
- handle expiration
- handle max attempts already reached
- verify code hash
- increment attempts only for wrong code using request-ID-conditional mutation
- handle invalid code
- handle max attempts after increment
- delete OTP state on success using request-ID-conditional mutation
- return verified response
Important details:
- correct code does not increment failed attempts
- successful verification is one-time-use
- delete failure after correct code returns error
- expired/max-attempt cleanup delete is best-effort
- verification logging is best-effort
- stale verify paths cannot delete or mutate a newer OTP state created by resend
Implemented PostgreSQL request logging using table otp_requests.
Behavior:
- create request log before Redis OTP save and SMS send
- update provider result after SMS success/failure
- request logging is mandatory in SendOTP
- provider response is safe and does not include OTP code
Implemented PostgreSQL verification logging using table otp_verifications.
Logged outcomes:
- success / verified
- failed / not_found
- failed / expired
- failed / invalid_code
- failed / max_attempts_exceeded
Important behavior:
- logging is best-effort
- logging failure does not change VerifyOTP response
- invalid request validation failures are not logged
- infrastructure failures are not logged
- success is logged only after Redis delete succeeds
Implemented endpoints:
POST /v1/otp/send
POST /v1/otp/verifyHTTP layer responsibilities:
- bind JSON
- validate required fields
- call service
- map domain errors to HTTP response
- keep business logic out of handlers
Error mappings include:
invalid request -> 400
tenant disabled -> 403
tenant not found -> 404
OTP already active -> 429
OTP send rate limit exceeded -> 429
SMS provider failed -> 502
generic/internal errors -> 500
Verify business failures return 200 OK with:
{
"verified": false,
"reason": "..."
}Implemented behavior:
If an active, unexpired OTP exists and resend cooldown has not elapsed,
a new SendOTP request is rejected.
This happens before the send rate limiter.
Response:
429 Too Many Requests
Retry-After: <remaining_seconds>Message:
OTP already active
Current semantics:
OTP_TTLcontrols how long an OTP can be verified.OTP_RESEND_COOLDOWNcontrols how soon a new OTP can be sent.- after cooldown elapses, resend is allowed even if the previous OTP has not expired.
- successful resend replaces the previous Redis OTP state.
- the previous OTP code becomes invalid.
- replacement is atomic and protected by request ID.
- cooldown retry timing is represented with
OTPResendCooldownError. OTPResendCooldownErrorstill unwraps toErrOTPAlreadyActivefor compatibility.
Implemented OTP send rate limiting with multiple Redis-backed strategies.
Supported dimensions:
tenant
phone
tenant + phone
Supported strategies:
fixed_window
token_bucket
Supported production-relevant mixed strategy:
tenant = token_bucket
phone = fixed_window
Current Redis key formats include:
otp:rate:send:fixed_window:tenant:{tenant_id}
otp:rate:send:fixed_window:phone:{tenant_id}:{phone}
otp:rate:send:token_bucket:tenant:{tenant_id}
otp:rate:send:token_bucket:phone:{tenant_id}:{phone}
otp:rate:send:token_bucket:{tenant:<tenant_id>}:tenant
otp:rate:send:fixed_window:{tenant:<tenant_id>}:phone:<phone>
Implementation details:
- Redis-backed adapters implement
otp.SendRateLimiter - fixed-window limiter uses Redis Lua for atomic counter + TTL handling
- token-bucket limiter uses Redis Lua for refill + consume + TTL handling
- mixed limiter uses a single Redis Lua decision for tenant token bucket + phone fixed window
- mixed limiter evaluates both dimensions before consuming either quota
- phone rejection does not consume tenant token
- tenant rejection does not increment phone counter
- both-dimension rejection returns the larger retry duration
- mixed limiter keys use a shared Redis Cluster hash tag per tenant
- missing phone fixed-window TTL is repaired in Lua
- malformed token bucket state returns infrastructure error rather than allowing traffic
- limit exceeded maps to
ErrOTPRateLimited - dimension-aware mixed limiter errors carry
phone,tenant,both, orunknown - rate limiter is optional and env-controlled
- disabled by default
Rate limiting runs after resend cooldown / active OTP protection.
Configured values include:
OTP_CODE_LENGTH
OTP_TTL
OTP_RESEND_COOLDOWN
OTP_MAX_ATTEMPTS
OTP_TENANT_CACHE_TTL
OTP_PROVIDER_TIMEOUT
OTP_FAKE_SMS_MIN_DELAY
OTP_FAKE_SMS_MAX_DELAY
OTP_FAKE_SMS_DEBUG_CODE_REDIS
OTP_FAKE_SMS_DEBUG_CODE_TTL
OTP_SEND_RATE_LIMIT_ENABLED
OTP_SEND_RATE_LIMIT_MAX
OTP_SEND_RATE_LIMIT_WINDOW
OTP_SEND_RATE_LIMIT_STRATEGY
OTP_SEND_RATE_LIMIT_PHONE_ENABLED
OTP_SEND_RATE_LIMIT_PHONE_STRATEGY
OTP_SEND_RATE_LIMIT_PHONE_MAX
OTP_SEND_RATE_LIMIT_PHONE_WINDOW
OTP_SEND_RATE_LIMIT_TENANT_ENABLED
OTP_SEND_RATE_LIMIT_TENANT_STRATEGY
OTP_SEND_RATE_LIMIT_TENANT_MAX
OTP_SEND_RATE_LIMIT_TENANT_WINDOW
Current important defaults:
OTP_CODE_LENGTH=6
OTP_TTL=2m
OTP_RESEND_COOLDOWN defaults to OTP_TTL when unset
OTP_MAX_ATTEMPTS=3
OTP_TENANT_CACHE_TTL=5m
OTP_PROVIDER_TIMEOUT=2s
OTP_FAKE_SMS_MIN_DELAY=20ms
OTP_FAKE_SMS_MAX_DELAY=30ms
OTP_FAKE_SMS_DEBUG_CODE_REDIS=false
OTP_FAKE_SMS_DEBUG_CODE_TTL=60s
OTP_SEND_RATE_LIMIT_ENABLED=false
OTP_SEND_RATE_LIMIT_MAX=5
OTP_SEND_RATE_LIMIT_WINDOW=10m
OTP_SEND_RATE_LIMIT_STRATEGY=fixed_window
Validation rules:
OTP_RESEND_COOLDOWN > 0
OTP_RESEND_COOLDOWN <= OTP_TTL
enabled rate-limit dimensions require positive max/window values
unsupported strategies are rejected
unsupported mixed strategies are rejected unless implemented atomically
Operational note:
When a runtime/env parameter is added, update env.example and .env together when the project runs from .env.
Redis is used for:
- OTP state
- OTP attempt counter
- tenant settings cache
- send rate limiter
- dev-only fake SMS OTP debug capture
Main key patterns:
otp:{tenant_id}:{phone}
tenant:{tenant_id}:settings
otp:rate:send:{tenant_id}:{phone}
otp:rate:send:fixed_window:tenant:{tenant_id}
otp:rate:send:fixed_window:phone:{tenant_id}:{phone}
otp:rate:send:token_bucket:tenant:{tenant_id}
otp:rate:send:token_bucket:phone:{tenant_id}:{phone}
otp:rate:send:token_bucket:{tenant:<tenant_id>}:tenant
otp:rate:send:fixed_window:{tenant:<tenant_id>}:phone:<phone>
debug:otp-code:{tenant_id}:{phone}
PostgreSQL is used for:
- tenant settings
- OTP request logs
- OTP verification logs
Tables involved:
tenant_settings
otp_requests
otp_verifications
Implemented test coverage includes:
- OTP generation tests
- hash/verify tests
- service SendOTP tests
- service VerifyOTP tests
- handler tests
- Redis OTP store tests
- Redis send rate limiter tests
- tenant cache provider tests
- request log repository tests
- verification log repository tests
- fake SMS provider tests
- config/env tests
Common verification commands:
go test -count=1 ./internal/otp -v
go test -count=1 ./internal/api -v
go test -count=1 ./internal/repository -v
go test -count=1 ./internal/config -v
go test -count=1 ./internal/sms -v
go test -count=1 ./...Manual validation has been performed for:
/v1/otp/send/v1/otp/verify- active OTP resend protection
- resend cooldown block
- resend after cooldown before OTP expiry
- invalidation of old OTP after resend
- verify with new OTP after resend
- one-time-use OTP behavior
- Redis OTP state deletion after successful verification
- Redis
resend_available_at_msstate - concurrent resend behavior
- resend cooldown
Retry-After - dev-only debug code capture
- verification logging
- request logging
- fixed-window send rate limiting
- send rate-limit
Retry-After - mixed limiter startup
- phone fixed-window block without tenant token consumption
- tenant token-bucket block without phone counter consumption
- mixed limiter Redis key layout and TTLs
- concurrent tenant token-bucket burst behavior
- phone fixed-window enforcement inside mixed limiter
rate_limited_phonemetric reasonrate_limited_tenantmetric reason
- Keep HTTP handlers thin.
- Keep OTP orchestration inside
internal/otp.Service. - Use interfaces for external dependencies.
- Keep Redis/PostgreSQL adapters in
internal/repository. - Keep fake SMS simulation in
internal/sms. - Do not expose OTP code in API responses.
- Do not store plaintext OTP in normal Redis OTP state.
- Use best-effort verification logging.
- Keep request logging mandatory for send lifecycle.
- Keep rate limiting optional and disabled by default.
- Keep changes incremental and small.
- Commit after each stable phase.
- No real SMS provider yet.
- No provider router/registry yet.
- OTP send outcome metrics exist, but full verify/provider/tracing observability is not complete yet.
- No circuit breaker yet.
- No retry policy yet.
- Send rate limiting supports tenant and phone dimensions, but no per-IP limiting exists yet.
- No global per-phone quota yet.
- Retry-After exists for resend cooldown and send rate limiting, but structured client-facing retry metadata is not yet part of the JSON schema.
- OTP phone input is normalized, but Redis key phone components are not hashed.
- No OpenAPI documentation.
- No auth/token validation for OTP endpoints yet.
- Atomic first-send reservation is implemented, but limiter quota and OTP reservation are still separate atomic decisions.
- OTP send idempotency is not implemented yet.
- SMS provider unknown-delivery semantics are not finalized yet.
- Transactional outbox / async SMS delivery is not implemented yet.
بعد از sliceهای جدید، وضعیت OTP rate limiting و resend behavior به این شکل است:
OTP_TTL controls verification validity.
OTP_RESEND_COOLDOWN controls resend eligibility.
Redis/Lua protects first-send creation.
Redis/Lua protects resend replacement.
Verify mutations are request-ID conditional.
Send rate limiting supports fixed_window, token_bucket, and one mixed strategy.
Metrics distinguish cooldown, phone rate limit, tenant rate limit, and both-dimension rate limit.
مسیر فعلی SendOTP بهشکل خلاصه:
validate request
-> normalize phone
-> load tenant settings
-> validate tenant
-> load existing OTP state
-> block if resend cooldown is active
-> run send limiter
-> generate request_id and code
-> create request log
-> CreateIfAbsent or ReplaceIfRequestID
-> send SMS
-> update provider result
-> observe terminal outcome
مسیر فعلی VerifyOTP بهشکل خلاصه:
validate request
-> normalize phone
-> load OTP state
-> check expiration/max attempts
-> verify code
-> IncrementAttemptsIfRequestID for wrong code
-> DeleteIfRequestID for successful verification
-> log verification outcome
دو نوع Retry-After وجود دارد:
resend cooldown -> OTP already active
send rate limit -> OTP send rate limit exceeded
هر دو 429 هستند، اما domain error و metric reason متفاوت دارند.
Metric اصلی:
otp_send_outcomes_total{result,reason}
Reasonهای مهم:
sms_sent
resend_cooldown
rate_limited
rate_limited_phone
rate_limited_tenant
rate_limited_both
reservation_collision
limiter_error
state_create_error
sms_provider_error
مرحله بعدی پیشنهادی:
OTP send idempotency
دلیل:
- retry شدن request از سمت client هنوز deterministic نیست.
- اگر client timeout بگیرد و دوباره send بزند، هنوز idempotency-key وجود ندارد.
- idempotency پایه لازم برای طراحی SMS provider retry و transactional outbox است.
Current availability roadmap status:
Phase 1: Single Traefik Gateway Baseline
Status: Done
Topology:
Client -> Traefik -> Backend Service
Phase 2: Backend HA behind Single Traefik
Status: Done
Topology:
Client -> Single Traefik -> Backend-1 / Backend-2
Phase 3: Traefik HA + Keepalived + VIP
Status: Done
Topology:
Client -> VIP -> Traefik-1 / Traefik-2 -> Backend-1 / Backend-2
Phase 4: Nginx version
Status: Deferred / Future
Phase 5: Redis/Postgres HA
Status: Deferred / Future
Current validated HA topology:
Client
-> VIP (192.168.56.100)
-> Traefik-1 / Traefik-2
-> backend-1 / backend-2
-> shared PostgreSQL / Redis / Mongo
The current gateway HA lab uses:
- VirtualBox
- Ubuntu Server 24.04 VMs
- Keepalived
- VRRP
- Virtual IP failover
- Traefik v3.7.1
Current VM roles:
proxy-1
IP: 192.168.56.11
Role: MASTER
Priority: 110
proxy-2
IP: 192.168.56.12
Role: BACKUP
Priority: 100
VIP
192.168.56.100
The backend availability lab currently supports:
Client
-> Single Traefik
-> backend-1 / backend-2
Current backend ports:
backend-1 direct:
localhost:8080
backend-2 direct:
localhost:8083
Traefik gateway:
localhost:8081
Traefik dashboard:
localhost:8082/dashboard/
Important implementation details:
- backend-1 and backend-2 share the same PostgreSQL/Redis/Mongo
- backend HA reuses the main dependency stack
- the original single API container is intentionally excluded during HA lab execution
- Traefik uses file-provider based upstream configuration
- health checks currently use /health
The following runtime behaviors have been manually validated:
Validated:
- backend-1 failure
- backend-2 failure
- backend recovery
- Traefik upstream recovery
- backend load balancing
Validated:
- Traefik failover
- VIP migration
- automatic failback
- reboot recovery
- simultaneous reboot recovery
Validated:
- MASTER/BACKUP election
- VRRP advertisements
- VIP ownership transfer
- service-based failover using Traefik health script
Current important runtime behavior:
Traefik service health controls VIP ownership.
If Traefik fails on the MASTER node:
- Keepalived removes VIP ownership
- BACKUP node becomes MASTER
- VIP migrates automatically
- traffic continues through the surviving node
When the higher-priority node recovers:
VIP automatically fails back to the preferred MASTER node.
Detailed documentation currently exists for:
deploy/availability-lab/traefik-baseline/
deploy/availability-lab/traefik-backend-ha/
deploy/availability-lab/traefik-vm-keepalived/
docs/phase3-traefik-keepalived-ha-lab.md
The full step-by-step VirtualBox + Keepalived + Traefik HA setup, validation workflow, failover tests, reboot validation, and troubleshooting are documented in:
docs/phase3-traefik-keepalived-ha-lab.md
The manually validated Phase 3 VM configuration templates are version-controlled in:
deploy/availability-lab/traefik-vm-keepalived/
The following concerns are intentionally deferred:
- TLS / HTTPS
- Kubernetes ingress/controller deployment
- cloud load balancer integration
- production firewall hardening
- production dashboard security
- Redis HA
- PostgreSQL HA
- Mongo HA
- distributed tracing for failover timing
- metrics for HA transitions
- Nginx HA implementation
- multi-datacenter routing
- BGP/advanced routing