وضعیت فعلی — OTP Service Project

Purpose

This project currently contains a working OTP flow for a Go backend service. The implementation has been built incrementally with small, reviewable slices using ChatGPT for architecture/review and Codex for focused implementation.

The OTP subsystem currently supports:

OTP send
OTP verify
Redis-backed OTP state
tenant settings lookup with Redis cache + PostgreSQL fallback
fake SMS provider
dev-only fake SMS OTP code capture
PostgreSQL request logging
PostgreSQL verification logging
resend protection while an OTP is active
separated OTP validity TTL and resend cooldown
Redis/Lua atomic OTP reservation for first send
Redis/Lua atomic OTP replacement for eligible resend
request-ID-conditional verify mutations
resend cooldown Retry-After
per tenant + phone OTP send rate limiting
Redis fixed-window send limiter
Redis token-bucket send limiter
Redis mixed send limiter for tenant=token_bucket and phone=fixed_window
low-cardinality OTP send outcome metrics
refined rate-limit metric reasons for phone, tenant, and both dimensions
HTTP handlers and routes
runtime wiring in cmd/server/main.go
env-driven OTP configuration
focused unit and integration-style tests

Current Architecture Snapshot

Current high-level flow:

HTTP API
  -> internal/api handlers
  -> internal/otp.Service
  -> ports/interfaces
  -> Redis/PostgreSQL/SMS adapters

Main components:

internal/api
  Thin HTTP handlers and route registration.

internal/otp
  Domain/application service for SendOTP and VerifyOTP.
  Owns core OTP orchestration, validation, hashing, retry/attempt rules, and domain errors.

internal/repository
  PostgreSQL and Redis adapters:
  - tenant settings repository
  - cached tenant settings provider
  - Redis OTP store
  - Redis send rate limiter
  - OTP request log repository
  - OTP verification log repository

internal/sms
  Fake SMS provider used for local/dev and simulated provider behavior.

internal/config
  Env-driven runtime configuration for OTP, fake SMS, and send rate limiting.

cmd/server/main.go
  Runtime wiring for database, Redis, repositories, OTP service, SMS provider, and routes.

Implemented Features

OTP Domain Foundation

Implemented:

OTP domain request/response models
tenant settings model
OTP state model
SMS request/result models
request/provider/verification log models
domain errors
interfaces/ports
OTP config defaults
OTP hashing helpers
configurable numeric OTP generation

OTP Generation and Hashing

Implemented:

dynamic numeric OTP generation
backward-compatible 6-digit generator
SHA-256 based hash helper
constant-time code verification helper
no plaintext OTP persistence in the main OTP state

Important behavior:

Redis OTP state stores code_hash only.
Plaintext OTP is not stored in the main OTP key.

Redis OTP Store

Implemented:

Redis-backed OTP state store
key format: otp:{tenant_id}:{phone}
Redis Hash storage
Save/Get/Delete
atomic CreateIfAbsent using Redis Lua
atomic ReplaceIfRequestID using Redis Lua
request-ID-conditional DeleteIfRequestID using Redis Lua
request-ID-conditional IncrementAttemptsIfRequestID using Redis Lua
atomic IncrementAttempts using Redis Lua for compatibility paths
TTL-based expiration
resend cooldown metadata based on Redis server time
malformed state detection
integration-style Redis tests

Stored fields:

request_id
tenant_id
phone
code_hash
attempt_count
max_attempts
created_at
expires_at
resend_available_at_ms

Important behavior:

First send creates OTP state only if no active state exists.
Eligible resend replaces OTP state only if the observed request_id is still current.
Verify mutations only affect the state if request_id still matches.

Tenant Settings Cache Provider

Implemented:

Redis cache-aside provider for tenant settings
PostgreSQL fallback
Redis key format: tenant:{tenant_id}:settings
stores only OTP-domain tenant settings subset
avoids caching sensitive/unneeded DB fields
falls back on malformed cache
source errors returned when PostgreSQL lookup fails

Fake SMS Provider

Implemented:

fake SMS provider implementing OTP SMS provider interface
configurable latency
default delay: 20ms to 30ms
context cancellation/timeout support
safe SMS result
no OTP code in RawResponse
dev-only Redis debug code capture

Dev-Only Fake SMS OTP Capture

Implemented for local/manual testing only.

Behavior:

disabled by default
enabled through config/env
only active outside release mode
stores plaintext OTP in a separate Redis debug key
does not expose code in API response
does not write code into otp_requests
does not write code into normal OTP state
does not log the code

Debug key format:

debug:otp-code:{tenant_id}:{phone}

SendOTP Service

Implemented behavior:

validate request
normalize phone
load tenant settings
validate tenant
load existing OTP state
block resend if active state exists and resend cooldown has not elapsed
check optional send rate limiter
generate request ID
generate OTP code
hash OTP code
create OTP request log
create OTP state atomically with CreateIfAbsent for first send
replace OTP state atomically with ReplaceIfRequestID for eligible resend
send SMS through provider only after successful Redis reservation/replacement
update provider result log
return request ID and expiration

Important details:

phone normalization happens before Redis keys, logs, limiter identity, and SMS send
tenant validation happens before Redis OTP state check
resend cooldown check happens before send rate limiting
eligible resend still passes through send rate limiting
blocked cooldown resend does not create request log
rate-limited send does not create request log
SMS is sent only after atomic OTP state reservation or replacement succeeds
SMS provider failure is mapped to domain provider failure
request logging is mandatory for send lifecycle
request ID changes on every successful resend
successful resend invalidates the previous OTP code

VerifyOTP Service

Implemented behavior:

validate request
normalize phone
load OTP state from Redis
handle not found
handle expiration
handle max attempts already reached
verify code hash
increment attempts only for wrong code using request-ID-conditional mutation
handle invalid code
handle max attempts after increment
delete OTP state on success using request-ID-conditional mutation
return verified response

Important details:

correct code does not increment failed attempts
successful verification is one-time-use
delete failure after correct code returns error
expired/max-attempt cleanup delete is best-effort
verification logging is best-effort
stale verify paths cannot delete or mutate a newer OTP state created by resend

Request Logging

Implemented PostgreSQL request logging using table otp_requests.

Behavior:

create request log before Redis OTP save and SMS send
update provider result after SMS success/failure
request logging is mandatory in SendOTP
provider response is safe and does not include OTP code

Verification Logging

Implemented PostgreSQL verification logging using table otp_verifications.

Logged outcomes:

success / verified
failed / not_found
failed / expired
failed / invalid_code
failed / max_attempts_exceeded

Important behavior:

logging is best-effort
logging failure does not change VerifyOTP response
invalid request validation failures are not logged
infrastructure failures are not logged
success is logged only after Redis delete succeeds

HTTP API

Implemented endpoints:

POST /v1/otp/send
POST /v1/otp/verify

HTTP layer responsibilities:

bind JSON
validate required fields
call service
map domain errors to HTTP response
keep business logic out of handlers

Error mappings include:

invalid request -> 400
tenant disabled -> 403
tenant not found -> 404
OTP already active -> 429
OTP send rate limit exceeded -> 429
SMS provider failed -> 502
generic/internal errors -> 500

Verify business failures return 200 OK with:

{
  "verified": false,
  "reason": "..."
}

Resend Protection and Cooldown

Implemented behavior:

If an active, unexpired OTP exists and resend cooldown has not elapsed,
a new SendOTP request is rejected.

This happens before the send rate limiter.

Response:

429 Too Many Requests
Retry-After: <remaining_seconds>

Message:

OTP already active

Current semantics:

OTP_TTL controls how long an OTP can be verified.
OTP_RESEND_COOLDOWN controls how soon a new OTP can be sent.
after cooldown elapses, resend is allowed even if the previous OTP has not expired.
successful resend replaces the previous Redis OTP state.
the previous OTP code becomes invalid.
replacement is atomic and protected by request ID.
cooldown retry timing is represented with OTPResendCooldownError.
OTPResendCooldownError still unwraps to ErrOTPAlreadyActive for compatibility.

Send Rate Limiting

Implemented OTP send rate limiting with multiple Redis-backed strategies.

Supported dimensions:

tenant
phone
tenant + phone

Supported strategies:

fixed_window
token_bucket

Supported production-relevant mixed strategy:

tenant = token_bucket
phone  = fixed_window

Current Redis key formats include:

otp:rate:send:fixed_window:tenant:{tenant_id}
otp:rate:send:fixed_window:phone:{tenant_id}:{phone}
otp:rate:send:token_bucket:tenant:{tenant_id}
otp:rate:send:token_bucket:phone:{tenant_id}:{phone}
otp:rate:send:token_bucket:{tenant:<tenant_id>}:tenant
otp:rate:send:fixed_window:{tenant:<tenant_id>}:phone:<phone>

Implementation details:

Redis-backed adapters implement otp.SendRateLimiter
fixed-window limiter uses Redis Lua for atomic counter + TTL handling
token-bucket limiter uses Redis Lua for refill + consume + TTL handling
mixed limiter uses a single Redis Lua decision for tenant token bucket + phone fixed window
mixed limiter evaluates both dimensions before consuming either quota
phone rejection does not consume tenant token
tenant rejection does not increment phone counter
both-dimension rejection returns the larger retry duration
mixed limiter keys use a shared Redis Cluster hash tag per tenant
missing phone fixed-window TTL is repaired in Lua
malformed token bucket state returns infrastructure error rather than allowing traffic
limit exceeded maps to ErrOTPRateLimited
dimension-aware mixed limiter errors carry phone, tenant, both, or unknown
rate limiter is optional and env-controlled
disabled by default

Rate limiting runs after resend cooldown / active OTP protection.

Current Configuration / Env Support

Configured values include:

OTP_CODE_LENGTH
OTP_TTL
OTP_RESEND_COOLDOWN
OTP_MAX_ATTEMPTS
OTP_TENANT_CACHE_TTL
OTP_PROVIDER_TIMEOUT

OTP_FAKE_SMS_MIN_DELAY
OTP_FAKE_SMS_MAX_DELAY
OTP_FAKE_SMS_DEBUG_CODE_REDIS
OTP_FAKE_SMS_DEBUG_CODE_TTL

OTP_SEND_RATE_LIMIT_ENABLED
OTP_SEND_RATE_LIMIT_MAX
OTP_SEND_RATE_LIMIT_WINDOW
OTP_SEND_RATE_LIMIT_STRATEGY

OTP_SEND_RATE_LIMIT_PHONE_ENABLED
OTP_SEND_RATE_LIMIT_PHONE_STRATEGY
OTP_SEND_RATE_LIMIT_PHONE_MAX
OTP_SEND_RATE_LIMIT_PHONE_WINDOW

OTP_SEND_RATE_LIMIT_TENANT_ENABLED
OTP_SEND_RATE_LIMIT_TENANT_STRATEGY
OTP_SEND_RATE_LIMIT_TENANT_MAX
OTP_SEND_RATE_LIMIT_TENANT_WINDOW

Current important defaults:

OTP_CODE_LENGTH=6
OTP_TTL=2m
OTP_RESEND_COOLDOWN defaults to OTP_TTL when unset
OTP_MAX_ATTEMPTS=3
OTP_TENANT_CACHE_TTL=5m
OTP_PROVIDER_TIMEOUT=2s

OTP_FAKE_SMS_MIN_DELAY=20ms
OTP_FAKE_SMS_MAX_DELAY=30ms
OTP_FAKE_SMS_DEBUG_CODE_REDIS=false
OTP_FAKE_SMS_DEBUG_CODE_TTL=60s

OTP_SEND_RATE_LIMIT_ENABLED=false
OTP_SEND_RATE_LIMIT_MAX=5
OTP_SEND_RATE_LIMIT_WINDOW=10m
OTP_SEND_RATE_LIMIT_STRATEGY=fixed_window

Validation rules:

OTP_RESEND_COOLDOWN > 0
OTP_RESEND_COOLDOWN <= OTP_TTL
enabled rate-limit dimensions require positive max/window values
unsupported strategies are rejected
unsupported mixed strategies are rejected unless implemented atomically

Operational note:

When a runtime/env parameter is added, update env.example and .env together when the project runs from .env.

Current Redis Usage

Redis is used for:

OTP state
OTP attempt counter
tenant settings cache
send rate limiter
dev-only fake SMS OTP debug capture

Main key patterns:

otp:{tenant_id}:{phone}
tenant:{tenant_id}:settings
otp:rate:send:{tenant_id}:{phone}
otp:rate:send:fixed_window:tenant:{tenant_id}
otp:rate:send:fixed_window:phone:{tenant_id}:{phone}
otp:rate:send:token_bucket:tenant:{tenant_id}
otp:rate:send:token_bucket:phone:{tenant_id}:{phone}
otp:rate:send:token_bucket:{tenant:<tenant_id>}:tenant
otp:rate:send:fixed_window:{tenant:<tenant_id>}:phone:<phone>
debug:otp-code:{tenant_id}:{phone}

Current PostgreSQL Usage

PostgreSQL is used for:

tenant settings
OTP request logs
OTP verification logs

Tables involved:

tenant_settings
otp_requests
otp_verifications

Current Testing Status

Implemented test coverage includes:

OTP generation tests
hash/verify tests
service SendOTP tests
service VerifyOTP tests
handler tests
Redis OTP store tests
Redis send rate limiter tests
tenant cache provider tests
request log repository tests
verification log repository tests
fake SMS provider tests
config/env tests

Common verification commands:

go test -count=1 ./internal/otp -v
go test -count=1 ./internal/api -v
go test -count=1 ./internal/repository -v
go test -count=1 ./internal/config -v
go test -count=1 ./internal/sms -v
go test -count=1 ./...

Current Manual Runtime Validation

Manual validation has been performed for:

/v1/otp/send
/v1/otp/verify
active OTP resend protection
resend cooldown block
resend after cooldown before OTP expiry
invalidation of old OTP after resend
verify with new OTP after resend
one-time-use OTP behavior
Redis OTP state deletion after successful verification
Redis resend_available_at_ms state
concurrent resend behavior
resend cooldown Retry-After
dev-only debug code capture
verification logging
request logging
fixed-window send rate limiting
send rate-limit Retry-After
mixed limiter startup
phone fixed-window block without tenant token consumption
tenant token-bucket block without phone counter consumption
mixed limiter Redis key layout and TTLs
concurrent tenant token-bucket burst behavior
phone fixed-window enforcement inside mixed limiter
rate_limited_phone metric reason
rate_limited_tenant metric reason

Important Implementation Decisions

Keep HTTP handlers thin.
Keep OTP orchestration inside internal/otp.Service.
Use interfaces for external dependencies.
Keep Redis/PostgreSQL adapters in internal/repository.
Keep fake SMS simulation in internal/sms.
Do not expose OTP code in API responses.
Do not store plaintext OTP in normal Redis OTP state.
Use best-effort verification logging.
Keep request logging mandatory for send lifecycle.
Keep rate limiting optional and disabled by default.
Keep changes incremental and small.
Commit after each stable phase.

Current Known Limitations

No real SMS provider yet.
No provider router/registry yet.
OTP send outcome metrics exist, but full verify/provider/tracing observability is not complete yet.
No circuit breaker yet.
No retry policy yet.
Send rate limiting supports tenant and phone dimensions, but no per-IP limiting exists yet.
No global per-phone quota yet.
Retry-After exists for resend cooldown and send rate limiting, but structured client-facing retry metadata is not yet part of the JSON schema.
OTP phone input is normalized, but Redis key phone components are not hashed.
No OpenAPI documentation.
No auth/token validation for OTP endpoints yet.
Atomic first-send reservation is implemented, but limiter quota and OTP reservation are still separate atomic decisions.
OTP send idempotency is not implemented yet.
SMS provider unknown-delivery semantics are not finalized yet.
Transactional outbox / async SMS delivery is not implemented yet.

به‌روزرسانی وضعیت OTP Rate Limiting و Resend Flow

خلاصه آخرین وضعیت

بعد از sliceهای جدید، وضعیت OTP rate limiting و resend behavior به این شکل است:

OTP_TTL controls verification validity.
OTP_RESEND_COOLDOWN controls resend eligibility.
Redis/Lua protects first-send creation.
Redis/Lua protects resend replacement.
Verify mutations are request-ID conditional.
Send rate limiting supports fixed_window, token_bucket, and one mixed strategy.
Metrics distinguish cooldown, phone rate limit, tenant rate limit, and both-dimension rate limit.

مسیر فعلی SendOTP

مسیر فعلی SendOTP به‌شکل خلاصه:

validate request
-> normalize phone
-> load tenant settings
-> validate tenant
-> load existing OTP state
-> block if resend cooldown is active
-> run send limiter
-> generate request_id and code
-> create request log
-> CreateIfAbsent or ReplaceIfRequestID
-> send SMS
-> update provider result
-> observe terminal outcome

مسیر فعلی VerifyOTP

مسیر فعلی VerifyOTP به‌شکل خلاصه:

validate request
-> normalize phone
-> load OTP state
-> check expiration/max attempts
-> verify code
-> IncrementAttemptsIfRequestID for wrong code
-> DeleteIfRequestID for successful verification
-> log verification outcome

رفتار Retry-After فعلی

دو نوع Retry-After وجود دارد:

resend cooldown -> OTP already active
send rate limit -> OTP send rate limit exceeded

هر دو 429 هستند، اما domain error و metric reason متفاوت دارند.

Metricهای OTP send outcome

Metric اصلی:

otp_send_outcomes_total{result,reason}

Reasonهای مهم:

sms_sent
resend_cooldown
rate_limited
rate_limited_phone
rate_limited_tenant
rate_limited_both
reservation_collision
limiter_error
state_create_error
sms_provider_error

مسیر بعدی پیشنهادی

مرحله بعدی پیشنهادی:

OTP send idempotency

دلیل:

retry شدن request از سمت client هنوز deterministic نیست.
اگر client timeout بگیرد و دوباره send بزند، هنوز idempotency-key وجود ندارد.
idempotency پایه لازم برای طراحی SMS provider retry و transactional outbox است.

وضعیت Availability / HA Lab Status

Availability Roadmap Status

Current availability roadmap status:

Phase 1: Single Traefik Gateway Baseline
Status: Done

Topology:
Client -> Traefik -> Backend Service

Phase 2: Backend HA behind Single Traefik
Status: Done

Topology:
Client -> Single Traefik -> Backend-1 / Backend-2

Phase 3: Traefik HA + Keepalived + VIP
Status: Done

Topology:
Client -> VIP -> Traefik-1 / Traefik-2 -> Backend-1 / Backend-2

Phase 4: Nginx version
Status: Deferred / Future

Phase 5: Redis/Postgres HA
Status: Deferred / Future

Current Availability Topology

Current validated HA topology:

Client
  -> VIP (192.168.56.100)
  -> Traefik-1 / Traefik-2
  -> backend-1 / backend-2
  -> shared PostgreSQL / Redis / Mongo

Current Gateway HA Lab

The current gateway HA lab uses:

- VirtualBox
- Ubuntu Server 24.04 VMs
- Keepalived
- VRRP
- Virtual IP failover
- Traefik v3.7.1

Current VM roles:

proxy-1
  IP: 192.168.56.11
  Role: MASTER
  Priority: 110

proxy-2
  IP: 192.168.56.12
  Role: BACKUP
  Priority: 100

VIP
  192.168.56.100

Current Backend HA Lab

The backend availability lab currently supports:

Client
  -> Single Traefik
  -> backend-1 / backend-2

Current backend ports:

backend-1 direct:
  localhost:8080

backend-2 direct:
  localhost:8083

Traefik gateway:
  localhost:8081

Traefik dashboard:
  localhost:8082/dashboard/

Important implementation details:

- backend-1 and backend-2 share the same PostgreSQL/Redis/Mongo
- backend HA reuses the main dependency stack
- the original single API container is intentionally excluded during HA lab execution
- Traefik uses file-provider based upstream configuration
- health checks currently use /health

Current Validated HA Behaviors

The following runtime behaviors have been manually validated:

Backend Availability

Validated:

- backend-1 failure
- backend-2 failure
- backend recovery
- Traefik upstream recovery
- backend load balancing

Gateway Availability

Validated:

- Traefik failover
- VIP migration
- automatic failback
- reboot recovery
- simultaneous reboot recovery

Keepalived / VRRP

Validated:

- MASTER/BACKUP election
- VRRP advertisements
- VIP ownership transfer
- service-based failover using Traefik health script

Current HA Runtime Notes

Current important runtime behavior:

Traefik service health controls VIP ownership.

If Traefik fails on the MASTER node:

- Keepalived removes VIP ownership
- BACKUP node becomes MASTER
- VIP migrates automatically
- traffic continues through the surviving node

When the higher-priority node recovers:

VIP automatically fails back to the preferred MASTER node.

Current HA-Related Documentation

Detailed documentation currently exists for:

deploy/availability-lab/traefik-baseline/
deploy/availability-lab/traefik-backend-ha/
deploy/availability-lab/traefik-vm-keepalived/
docs/phase3-traefik-keepalived-ha-lab.md

The full step-by-step VirtualBox + Keepalived + Traefik HA setup, validation workflow, failover tests, reboot validation, and troubleshooting are documented in:

docs/phase3-traefik-keepalived-ha-lab.md

The manually validated Phase 3 VM configuration templates are version-controlled in:

deploy/availability-lab/traefik-vm-keepalived/

Current HA Deferred Concerns

The following concerns are intentionally deferred:

- TLS / HTTPS
- Kubernetes ingress/controller deployment
- cloud load balancer integration
- production firewall hardening
- production dashboard security
- Redis HA
- PostgreSQL HA
- Mongo HA
- distributed tracing for failover timing
- metrics for HA transitions
- Nginx HA implementation
- multi-datacenter routing
- BGP/advanced routing

FilesExpand file tree

current-state.md

Latest commit

History

current-state.md

File metadata and controls

وضعیت فعلی — OTP Service Project

Purpose

Current Architecture Snapshot

Implemented Features

OTP Domain Foundation

OTP Generation and Hashing

Redis OTP Store

Tenant Settings Cache Provider

Fake SMS Provider

Dev-Only Fake SMS OTP Capture

SendOTP Service

VerifyOTP Service

Request Logging

Verification Logging

HTTP API

Resend Protection and Cooldown

Send Rate Limiting

Current Configuration / Env Support

Current Redis Usage

Current PostgreSQL Usage

Current Testing Status

Current Manual Runtime Validation

Important Implementation Decisions

Current Known Limitations

به‌روزرسانی وضعیت OTP Rate Limiting و Resend Flow

خلاصه آخرین وضعیت

مسیر فعلی SendOTP

مسیر فعلی VerifyOTP

رفتار Retry-After فعلی

Metricهای OTP send outcome

مسیر بعدی پیشنهادی

وضعیت Availability / HA Lab Status

Availability Roadmap Status

Current Availability Topology

Current Gateway HA Lab

Current Backend HA Lab

Current Validated HA Behaviors

Backend Availability

Gateway Availability

Keepalived / VRRP

Current HA Runtime Notes

Current HA-Related Documentation

Current HA Deferred Concerns