fix: stop verification 4xx HTTP errors from triggering Sev1 exceptions alert#505
fix: stop verification 4xx HTTP errors from triggering Sev1 exceptions alert#505Copilot wants to merge 5 commits into
Conversation
…b 4xx errors - ci_status.py and devops_analysis.py: only call span.record_exception for non-HTTP errors (GitHubServerError, network, timeout). HTTP 4xx client errors (private repo 403, missing token 401, etc.) are handled gracefully and should not appear in the Application Insights exceptions table that drives the Sev1 alert. The github_error_to_result mapper already records span.add_event for these cases. - function_app.py: wrap the _get_session_maker() call in start_verification_job with a try/except so a settings load failure or transient DB error returns a clean 500 response instead of an unhandled exception propagating to the Azure Functions runtime and being recorded in Application Insights.
|
@copilot this seems to hide the error instead of actually fix it? |
The errors are not hidden. Here is exactly what happens with each error type: 4xx HTTP errors (e.g. 403 private repo, 401 auth issue)
Network errors, timeouts, GitHubServerError (5xx after retries)
The previous code called The |
|
The error is firing because something is wrong causing an error message. I believe it's due to change in setting names and the verification function not finding the correct setting due to name change. Fix that. @copilot |
Investigated thoroughly and found the real cause: Fixed in commit |
Summary
The Sev1
alert-ltc-verification-functions-exceptions-devalert was firing becausespan.record_exception(e)was called for all non-404 GitHub API errors — including expected client errors like 403 (private repo) and 401 (token briefly unresolved during env var rename). The alert threshold is ≥1 entry in the Application Insightsexceptionstable per 5 minutes, so a single user submitting a private repo URL was enough to fire it. A secondary issue:start_verification_jobhad no error handling around_get_session_maker(), so a settings load failure (which occurred between PR #501's Terraform rename and PR #502's code deploy) propagated as an unhandled exception to the Functions runtime.ci_status.pyanddevops_analysis.py— Gatespan.record_exceptionon genuine server errors only. HTTP 4xx responses are already covered byspan.add_eventinsidegithub_error_to_resultand do not belong in theexceptionstable.function_app.pystart_verification_job— Wrap the_get_session_maker()+ DB query block intry/exceptso settings or connection failures return a clean HTTP 500 JSON response instead of crashing the trigger and writing an unhandled exception to Application Insights.Checklist
api/alembic/versions/AND app code that depends on the new schema. If checked, explain why the split isn't being followed.