Skip to content

Latest commit

 

History

History
936 lines (718 loc) · 22.3 KB

File metadata and controls

936 lines (718 loc) · 22.3 KB

وضعیت فعلی — OTP Service Project

Purpose

This project currently contains a working OTP flow for a Go backend service. The implementation has been built incrementally with small, reviewable slices using ChatGPT for architecture/review and Codex for focused implementation.

The OTP subsystem currently supports:

  • OTP send
  • OTP verify
  • Redis-backed OTP state
  • tenant settings lookup with Redis cache + PostgreSQL fallback
  • fake SMS provider
  • dev-only fake SMS OTP code capture
  • PostgreSQL request logging
  • PostgreSQL verification logging
  • resend protection while an OTP is active
  • separated OTP validity TTL and resend cooldown
  • Redis/Lua atomic OTP reservation for first send
  • Redis/Lua atomic OTP replacement for eligible resend
  • request-ID-conditional verify mutations
  • resend cooldown Retry-After
  • per tenant + phone OTP send rate limiting
  • Redis fixed-window send limiter
  • Redis token-bucket send limiter
  • Redis mixed send limiter for tenant=token_bucket and phone=fixed_window
  • low-cardinality OTP send outcome metrics
  • refined rate-limit metric reasons for phone, tenant, and both dimensions
  • HTTP handlers and routes
  • runtime wiring in cmd/server/main.go
  • env-driven OTP configuration
  • focused unit and integration-style tests

Current Architecture Snapshot

Current high-level flow:

HTTP API
  -> internal/api handlers
  -> internal/otp.Service
  -> ports/interfaces
  -> Redis/PostgreSQL/SMS adapters

Main components:

internal/api
  Thin HTTP handlers and route registration.

internal/otp
  Domain/application service for SendOTP and VerifyOTP.
  Owns core OTP orchestration, validation, hashing, retry/attempt rules, and domain errors.

internal/repository
  PostgreSQL and Redis adapters:
  - tenant settings repository
  - cached tenant settings provider
  - Redis OTP store
  - Redis send rate limiter
  - OTP request log repository
  - OTP verification log repository

internal/sms
  Fake SMS provider used for local/dev and simulated provider behavior.

internal/config
  Env-driven runtime configuration for OTP, fake SMS, and send rate limiting.

cmd/server/main.go
  Runtime wiring for database, Redis, repositories, OTP service, SMS provider, and routes.

Implemented Features

OTP Domain Foundation

Implemented:

  • OTP domain request/response models
  • tenant settings model
  • OTP state model
  • SMS request/result models
  • request/provider/verification log models
  • domain errors
  • interfaces/ports
  • OTP config defaults
  • OTP hashing helpers
  • configurable numeric OTP generation

OTP Generation and Hashing

Implemented:

  • dynamic numeric OTP generation
  • backward-compatible 6-digit generator
  • SHA-256 based hash helper
  • constant-time code verification helper
  • no plaintext OTP persistence in the main OTP state

Important behavior:

Redis OTP state stores code_hash only.
Plaintext OTP is not stored in the main OTP key.

Redis OTP Store

Implemented:

  • Redis-backed OTP state store
  • key format: otp:{tenant_id}:{phone}
  • Redis Hash storage
  • Save/Get/Delete
  • atomic CreateIfAbsent using Redis Lua
  • atomic ReplaceIfRequestID using Redis Lua
  • request-ID-conditional DeleteIfRequestID using Redis Lua
  • request-ID-conditional IncrementAttemptsIfRequestID using Redis Lua
  • atomic IncrementAttempts using Redis Lua for compatibility paths
  • TTL-based expiration
  • resend cooldown metadata based on Redis server time
  • malformed state detection
  • integration-style Redis tests

Stored fields:

request_id
tenant_id
phone
code_hash
attempt_count
max_attempts
created_at
expires_at
resend_available_at_ms

Important behavior:

First send creates OTP state only if no active state exists.
Eligible resend replaces OTP state only if the observed request_id is still current.
Verify mutations only affect the state if request_id still matches.

Tenant Settings Cache Provider

Implemented:

  • Redis cache-aside provider for tenant settings
  • PostgreSQL fallback
  • Redis key format: tenant:{tenant_id}:settings
  • stores only OTP-domain tenant settings subset
  • avoids caching sensitive/unneeded DB fields
  • falls back on malformed cache
  • source errors returned when PostgreSQL lookup fails

Fake SMS Provider

Implemented:

  • fake SMS provider implementing OTP SMS provider interface
  • configurable latency
  • default delay: 20ms to 30ms
  • context cancellation/timeout support
  • safe SMS result
  • no OTP code in RawResponse
  • dev-only Redis debug code capture

Dev-Only Fake SMS OTP Capture

Implemented for local/manual testing only.

Behavior:

  • disabled by default
  • enabled through config/env
  • only active outside release mode
  • stores plaintext OTP in a separate Redis debug key
  • does not expose code in API response
  • does not write code into otp_requests
  • does not write code into normal OTP state
  • does not log the code

Debug key format:

debug:otp-code:{tenant_id}:{phone}

SendOTP Service

Implemented behavior:

  1. validate request
  2. normalize phone
  3. load tenant settings
  4. validate tenant
  5. load existing OTP state
  6. block resend if active state exists and resend cooldown has not elapsed
  7. check optional send rate limiter
  8. generate request ID
  9. generate OTP code
  10. hash OTP code
  11. create OTP request log
  12. create OTP state atomically with CreateIfAbsent for first send
  13. replace OTP state atomically with ReplaceIfRequestID for eligible resend
  14. send SMS through provider only after successful Redis reservation/replacement
  15. update provider result log
  16. return request ID and expiration

Important details:

  • phone normalization happens before Redis keys, logs, limiter identity, and SMS send
  • tenant validation happens before Redis OTP state check
  • resend cooldown check happens before send rate limiting
  • eligible resend still passes through send rate limiting
  • blocked cooldown resend does not create request log
  • rate-limited send does not create request log
  • SMS is sent only after atomic OTP state reservation or replacement succeeds
  • SMS provider failure is mapped to domain provider failure
  • request logging is mandatory for send lifecycle
  • request ID changes on every successful resend
  • successful resend invalidates the previous OTP code

VerifyOTP Service

Implemented behavior:

  1. validate request
  2. normalize phone
  3. load OTP state from Redis
  4. handle not found
  5. handle expiration
  6. handle max attempts already reached
  7. verify code hash
  8. increment attempts only for wrong code using request-ID-conditional mutation
  9. handle invalid code
  10. handle max attempts after increment
  11. delete OTP state on success using request-ID-conditional mutation
  12. return verified response

Important details:

  • correct code does not increment failed attempts
  • successful verification is one-time-use
  • delete failure after correct code returns error
  • expired/max-attempt cleanup delete is best-effort
  • verification logging is best-effort
  • stale verify paths cannot delete or mutate a newer OTP state created by resend

Request Logging

Implemented PostgreSQL request logging using table otp_requests.

Behavior:

  • create request log before Redis OTP save and SMS send
  • update provider result after SMS success/failure
  • request logging is mandatory in SendOTP
  • provider response is safe and does not include OTP code

Verification Logging

Implemented PostgreSQL verification logging using table otp_verifications.

Logged outcomes:

  • success / verified
  • failed / not_found
  • failed / expired
  • failed / invalid_code
  • failed / max_attempts_exceeded

Important behavior:

  • logging is best-effort
  • logging failure does not change VerifyOTP response
  • invalid request validation failures are not logged
  • infrastructure failures are not logged
  • success is logged only after Redis delete succeeds

HTTP API

Implemented endpoints:

POST /v1/otp/send
POST /v1/otp/verify

HTTP layer responsibilities:

  • bind JSON
  • validate required fields
  • call service
  • map domain errors to HTTP response
  • keep business logic out of handlers

Error mappings include:

invalid request -> 400
tenant disabled -> 403
tenant not found -> 404
OTP already active -> 429
OTP send rate limit exceeded -> 429
SMS provider failed -> 502
generic/internal errors -> 500

Verify business failures return 200 OK with:

{
  "verified": false,
  "reason": "..."
}

Resend Protection and Cooldown

Implemented behavior:

If an active, unexpired OTP exists and resend cooldown has not elapsed,
a new SendOTP request is rejected.

This happens before the send rate limiter.

Response:

429 Too Many Requests
Retry-After: <remaining_seconds>

Message:

OTP already active

Current semantics:

  • OTP_TTL controls how long an OTP can be verified.
  • OTP_RESEND_COOLDOWN controls how soon a new OTP can be sent.
  • after cooldown elapses, resend is allowed even if the previous OTP has not expired.
  • successful resend replaces the previous Redis OTP state.
  • the previous OTP code becomes invalid.
  • replacement is atomic and protected by request ID.
  • cooldown retry timing is represented with OTPResendCooldownError.
  • OTPResendCooldownError still unwraps to ErrOTPAlreadyActive for compatibility.

Send Rate Limiting

Implemented OTP send rate limiting with multiple Redis-backed strategies.

Supported dimensions:

tenant
phone
tenant + phone

Supported strategies:

fixed_window
token_bucket

Supported production-relevant mixed strategy:

tenant = token_bucket
phone  = fixed_window

Current Redis key formats include:

otp:rate:send:fixed_window:tenant:{tenant_id}
otp:rate:send:fixed_window:phone:{tenant_id}:{phone}
otp:rate:send:token_bucket:tenant:{tenant_id}
otp:rate:send:token_bucket:phone:{tenant_id}:{phone}
otp:rate:send:token_bucket:{tenant:<tenant_id>}:tenant
otp:rate:send:fixed_window:{tenant:<tenant_id>}:phone:<phone>

Implementation details:

  • Redis-backed adapters implement otp.SendRateLimiter
  • fixed-window limiter uses Redis Lua for atomic counter + TTL handling
  • token-bucket limiter uses Redis Lua for refill + consume + TTL handling
  • mixed limiter uses a single Redis Lua decision for tenant token bucket + phone fixed window
  • mixed limiter evaluates both dimensions before consuming either quota
  • phone rejection does not consume tenant token
  • tenant rejection does not increment phone counter
  • both-dimension rejection returns the larger retry duration
  • mixed limiter keys use a shared Redis Cluster hash tag per tenant
  • missing phone fixed-window TTL is repaired in Lua
  • malformed token bucket state returns infrastructure error rather than allowing traffic
  • limit exceeded maps to ErrOTPRateLimited
  • dimension-aware mixed limiter errors carry phone, tenant, both, or unknown
  • rate limiter is optional and env-controlled
  • disabled by default

Rate limiting runs after resend cooldown / active OTP protection.

Current Configuration / Env Support

Configured values include:

OTP_CODE_LENGTH
OTP_TTL
OTP_RESEND_COOLDOWN
OTP_MAX_ATTEMPTS
OTP_TENANT_CACHE_TTL
OTP_PROVIDER_TIMEOUT

OTP_FAKE_SMS_MIN_DELAY
OTP_FAKE_SMS_MAX_DELAY
OTP_FAKE_SMS_DEBUG_CODE_REDIS
OTP_FAKE_SMS_DEBUG_CODE_TTL

OTP_SEND_RATE_LIMIT_ENABLED
OTP_SEND_RATE_LIMIT_MAX
OTP_SEND_RATE_LIMIT_WINDOW
OTP_SEND_RATE_LIMIT_STRATEGY

OTP_SEND_RATE_LIMIT_PHONE_ENABLED
OTP_SEND_RATE_LIMIT_PHONE_STRATEGY
OTP_SEND_RATE_LIMIT_PHONE_MAX
OTP_SEND_RATE_LIMIT_PHONE_WINDOW

OTP_SEND_RATE_LIMIT_TENANT_ENABLED
OTP_SEND_RATE_LIMIT_TENANT_STRATEGY
OTP_SEND_RATE_LIMIT_TENANT_MAX
OTP_SEND_RATE_LIMIT_TENANT_WINDOW

Current important defaults:

OTP_CODE_LENGTH=6
OTP_TTL=2m
OTP_RESEND_COOLDOWN defaults to OTP_TTL when unset
OTP_MAX_ATTEMPTS=3
OTP_TENANT_CACHE_TTL=5m
OTP_PROVIDER_TIMEOUT=2s

OTP_FAKE_SMS_MIN_DELAY=20ms
OTP_FAKE_SMS_MAX_DELAY=30ms
OTP_FAKE_SMS_DEBUG_CODE_REDIS=false
OTP_FAKE_SMS_DEBUG_CODE_TTL=60s

OTP_SEND_RATE_LIMIT_ENABLED=false
OTP_SEND_RATE_LIMIT_MAX=5
OTP_SEND_RATE_LIMIT_WINDOW=10m
OTP_SEND_RATE_LIMIT_STRATEGY=fixed_window

Validation rules:

OTP_RESEND_COOLDOWN > 0
OTP_RESEND_COOLDOWN <= OTP_TTL
enabled rate-limit dimensions require positive max/window values
unsupported strategies are rejected
unsupported mixed strategies are rejected unless implemented atomically

Operational note:

When a runtime/env parameter is added, update env.example and .env together when the project runs from .env.

Current Redis Usage

Redis is used for:

  • OTP state
  • OTP attempt counter
  • tenant settings cache
  • send rate limiter
  • dev-only fake SMS OTP debug capture

Main key patterns:

otp:{tenant_id}:{phone}
tenant:{tenant_id}:settings
otp:rate:send:{tenant_id}:{phone}
otp:rate:send:fixed_window:tenant:{tenant_id}
otp:rate:send:fixed_window:phone:{tenant_id}:{phone}
otp:rate:send:token_bucket:tenant:{tenant_id}
otp:rate:send:token_bucket:phone:{tenant_id}:{phone}
otp:rate:send:token_bucket:{tenant:<tenant_id>}:tenant
otp:rate:send:fixed_window:{tenant:<tenant_id>}:phone:<phone>
debug:otp-code:{tenant_id}:{phone}

Current PostgreSQL Usage

PostgreSQL is used for:

  • tenant settings
  • OTP request logs
  • OTP verification logs

Tables involved:

tenant_settings
otp_requests
otp_verifications

Current Testing Status

Implemented test coverage includes:

  • OTP generation tests
  • hash/verify tests
  • service SendOTP tests
  • service VerifyOTP tests
  • handler tests
  • Redis OTP store tests
  • Redis send rate limiter tests
  • tenant cache provider tests
  • request log repository tests
  • verification log repository tests
  • fake SMS provider tests
  • config/env tests

Common verification commands:

go test -count=1 ./internal/otp -v
go test -count=1 ./internal/api -v
go test -count=1 ./internal/repository -v
go test -count=1 ./internal/config -v
go test -count=1 ./internal/sms -v
go test -count=1 ./...

Current Manual Runtime Validation

Manual validation has been performed for:

  • /v1/otp/send
  • /v1/otp/verify
  • active OTP resend protection
  • resend cooldown block
  • resend after cooldown before OTP expiry
  • invalidation of old OTP after resend
  • verify with new OTP after resend
  • one-time-use OTP behavior
  • Redis OTP state deletion after successful verification
  • Redis resend_available_at_ms state
  • concurrent resend behavior
  • resend cooldown Retry-After
  • dev-only debug code capture
  • verification logging
  • request logging
  • fixed-window send rate limiting
  • send rate-limit Retry-After
  • mixed limiter startup
  • phone fixed-window block without tenant token consumption
  • tenant token-bucket block without phone counter consumption
  • mixed limiter Redis key layout and TTLs
  • concurrent tenant token-bucket burst behavior
  • phone fixed-window enforcement inside mixed limiter
  • rate_limited_phone metric reason
  • rate_limited_tenant metric reason

Important Implementation Decisions

  • Keep HTTP handlers thin.
  • Keep OTP orchestration inside internal/otp.Service.
  • Use interfaces for external dependencies.
  • Keep Redis/PostgreSQL adapters in internal/repository.
  • Keep fake SMS simulation in internal/sms.
  • Do not expose OTP code in API responses.
  • Do not store plaintext OTP in normal Redis OTP state.
  • Use best-effort verification logging.
  • Keep request logging mandatory for send lifecycle.
  • Keep rate limiting optional and disabled by default.
  • Keep changes incremental and small.
  • Commit after each stable phase.

Current Known Limitations

  • No real SMS provider yet.
  • No provider router/registry yet.
  • OTP send outcome metrics exist, but full verify/provider/tracing observability is not complete yet.
  • No circuit breaker yet.
  • No retry policy yet.
  • Send rate limiting supports tenant and phone dimensions, but no per-IP limiting exists yet.
  • No global per-phone quota yet.
  • Retry-After exists for resend cooldown and send rate limiting, but structured client-facing retry metadata is not yet part of the JSON schema.
  • OTP phone input is normalized, but Redis key phone components are not hashed.
  • No OpenAPI documentation.
  • No auth/token validation for OTP endpoints yet.
  • Atomic first-send reservation is implemented, but limiter quota and OTP reservation are still separate atomic decisions.
  • OTP send idempotency is not implemented yet.
  • SMS provider unknown-delivery semantics are not finalized yet.
  • Transactional outbox / async SMS delivery is not implemented yet.

به‌روزرسانی وضعیت OTP Rate Limiting و Resend Flow

خلاصه آخرین وضعیت

بعد از sliceهای جدید، وضعیت OTP rate limiting و resend behavior به این شکل است:

OTP_TTL controls verification validity.
OTP_RESEND_COOLDOWN controls resend eligibility.
Redis/Lua protects first-send creation.
Redis/Lua protects resend replacement.
Verify mutations are request-ID conditional.
Send rate limiting supports fixed_window, token_bucket, and one mixed strategy.
Metrics distinguish cooldown, phone rate limit, tenant rate limit, and both-dimension rate limit.

مسیر فعلی SendOTP

مسیر فعلی SendOTP به‌شکل خلاصه:

validate request
-> normalize phone
-> load tenant settings
-> validate tenant
-> load existing OTP state
-> block if resend cooldown is active
-> run send limiter
-> generate request_id and code
-> create request log
-> CreateIfAbsent or ReplaceIfRequestID
-> send SMS
-> update provider result
-> observe terminal outcome

مسیر فعلی VerifyOTP

مسیر فعلی VerifyOTP به‌شکل خلاصه:

validate request
-> normalize phone
-> load OTP state
-> check expiration/max attempts
-> verify code
-> IncrementAttemptsIfRequestID for wrong code
-> DeleteIfRequestID for successful verification
-> log verification outcome

رفتار Retry-After فعلی

دو نوع Retry-After وجود دارد:

resend cooldown -> OTP already active
send rate limit -> OTP send rate limit exceeded

هر دو 429 هستند، اما domain error و metric reason متفاوت دارند.

Metricهای OTP send outcome

Metric اصلی:

otp_send_outcomes_total{result,reason}

Reasonهای مهم:

sms_sent
resend_cooldown
rate_limited
rate_limited_phone
rate_limited_tenant
rate_limited_both
reservation_collision
limiter_error
state_create_error
sms_provider_error

مسیر بعدی پیشنهادی

مرحله بعدی پیشنهادی:

OTP send idempotency

دلیل:

  • retry شدن request از سمت client هنوز deterministic نیست.
  • اگر client timeout بگیرد و دوباره send بزند، هنوز idempotency-key وجود ندارد.
  • idempotency پایه لازم برای طراحی SMS provider retry و transactional outbox است.

وضعیت Availability / HA Lab Status

Availability Roadmap Status

Current availability roadmap status:

Phase 1: Single Traefik Gateway Baseline
Status: Done

Topology:
Client -> Traefik -> Backend Service

Phase 2: Backend HA behind Single Traefik
Status: Done

Topology:
Client -> Single Traefik -> Backend-1 / Backend-2

Phase 3: Traefik HA + Keepalived + VIP
Status: Done

Topology:
Client -> VIP -> Traefik-1 / Traefik-2 -> Backend-1 / Backend-2

Phase 4: Nginx version
Status: Deferred / Future

Phase 5: Redis/Postgres HA
Status: Deferred / Future

Current Availability Topology

Current validated HA topology:

Client
  -> VIP (192.168.56.100)
  -> Traefik-1 / Traefik-2
  -> backend-1 / backend-2
  -> shared PostgreSQL / Redis / Mongo

Current Gateway HA Lab

The current gateway HA lab uses:

- VirtualBox
- Ubuntu Server 24.04 VMs
- Keepalived
- VRRP
- Virtual IP failover
- Traefik v3.7.1

Current VM roles:

proxy-1
  IP: 192.168.56.11
  Role: MASTER
  Priority: 110

proxy-2
  IP: 192.168.56.12
  Role: BACKUP
  Priority: 100

VIP
  192.168.56.100

Current Backend HA Lab

The backend availability lab currently supports:

Client
  -> Single Traefik
  -> backend-1 / backend-2

Current backend ports:

backend-1 direct:
  localhost:8080

backend-2 direct:
  localhost:8083

Traefik gateway:
  localhost:8081

Traefik dashboard:
  localhost:8082/dashboard/

Important implementation details:

- backend-1 and backend-2 share the same PostgreSQL/Redis/Mongo
- backend HA reuses the main dependency stack
- the original single API container is intentionally excluded during HA lab execution
- Traefik uses file-provider based upstream configuration
- health checks currently use /health

Current Validated HA Behaviors

The following runtime behaviors have been manually validated:

Backend Availability

Validated:

- backend-1 failure
- backend-2 failure
- backend recovery
- Traefik upstream recovery
- backend load balancing

Gateway Availability

Validated:

- Traefik failover
- VIP migration
- automatic failback
- reboot recovery
- simultaneous reboot recovery

Keepalived / VRRP

Validated:

- MASTER/BACKUP election
- VRRP advertisements
- VIP ownership transfer
- service-based failover using Traefik health script

Current HA Runtime Notes

Current important runtime behavior:

Traefik service health controls VIP ownership.

If Traefik fails on the MASTER node:

- Keepalived removes VIP ownership
- BACKUP node becomes MASTER
- VIP migrates automatically
- traffic continues through the surviving node

When the higher-priority node recovers:

VIP automatically fails back to the preferred MASTER node.

Current HA-Related Documentation

Detailed documentation currently exists for:

deploy/availability-lab/traefik-baseline/
deploy/availability-lab/traefik-backend-ha/
deploy/availability-lab/traefik-vm-keepalived/
docs/phase3-traefik-keepalived-ha-lab.md

The full step-by-step VirtualBox + Keepalived + Traefik HA setup, validation workflow, failover tests, reboot validation, and troubleshooting are documented in:

docs/phase3-traefik-keepalived-ha-lab.md

The manually validated Phase 3 VM configuration templates are version-controlled in:

deploy/availability-lab/traefik-vm-keepalived/

Current HA Deferred Concerns

The following concerns are intentionally deferred:

- TLS / HTTPS
- Kubernetes ingress/controller deployment
- cloud load balancer integration
- production firewall hardening
- production dashboard security
- Redis HA
- PostgreSQL HA
- Mongo HA
- distributed tracing for failover timing
- metrics for HA transitions
- Nginx HA implementation
- multi-datacenter routing
- BGP/advanced routing