Improve report generation resilience#171
Conversation
Add per-analysis task timeouts so a single stuck LLM subtask does not block the whole report forever. Discover and cache official T2I endpoints, prefer backup endpoints when the default public endpoint is unhealthy, and surface non-image HTML error pages in logs for faster diagnosis. Also bound image generation and image sending with explicit timeouts so failed renders fall back to text output predictably instead of hanging the request path.
Reviewer's GuideImproves robustness of group daily report generation by adding per-task timeouts to LLM analyses, introducing T2I endpoint discovery and fallback with better validation and logging of non-image responses, and enforcing bounded timeouts around image report generation and sending. Sequence diagram for resilient image report generation with timeouts and T2I fallbacksequenceDiagram
actor User
participant GroupDailyAnalysis
participant LLMAnalyzer
participant ReportGenerator
participant OfficialEndpointsAPI
participant T2IEndpoint
participant Adapter
User->>GroupDailyAnalysis: request_group_daily_report
GroupDailyAnalysis->>LLMAnalyzer: analyze_all_concurrent(messages,...)
LLMAnalyzer-->>GroupDailyAnalysis: analysis_result
alt output_format_image
GroupDailyAnalysis->>GroupDailyAnalysis: asyncio.wait_for(generate_image_report, 240s)
GroupDailyAnalysis->>ReportGenerator: generate_image_report(analysis_result, group_id, html_render,...)
ReportGenerator->>ReportGenerator: _get_t2i_render_endpoints()
alt endpoints_cache_valid
ReportGenerator-->>ReportGenerator: use_cached_endpoints
else cache_miss
ReportGenerator->>OfficialEndpointsAPI: GET /astrbot/t2i-endpoints
OfficialEndpointsAPI-->>ReportGenerator: active endpoint list
ReportGenerator->>ReportGenerator: normalize_deduplicate_and_cache_endpoints
end
loop image_options_strategies
loop render_endpoints
GroupDailyAnalysis->>ReportGenerator: try_strategy_with_endpoint
ReportGenerator->>T2IEndpoint: render_custom_template(html, options)
alt within_75s_timeout
T2IEndpoint-->>ReportGenerator: image_or_error_data
alt valid_image_bytes_or_file
ReportGenerator-->>GroupDailyAnalysis: image_url, html_content
GroupDailyAnalysis-->>Adapter: asyncio.wait_for(send_image, 60s)
Adapter-->>GroupDailyAnalysis: sent_success
GroupDailyAnalysis-->>User: image_report_sent
else non_image_html_or_invalid_data
ReportGenerator->>ReportGenerator: _extract_non_image_response_summary
ReportGenerator->>ReportGenerator: log_upstream_html_failure
end
else strategy_or_endpoint_timeout
ReportGenerator->>ReportGenerator: log_timeout_and_try_next
end
end
end
alt all_strategies_failed
ReportGenerator-->>GroupDailyAnalysis: image_url None
GroupDailyAnalysis-->>User: fallback_to_text_report
end
else output_format_text
GroupDailyAnalysis-->>User: text_report
end
Sequence diagram for concurrent LLM analysis with per-task timeoutssequenceDiagram
participant GroupDailyAnalysis
participant LLMAnalyzer
participant ConfigManager
participant TopicAnalyzer
participant UserTitleAnalyzer
participant GoldenQuoteAnalyzer
participant ChatQualityAnalyzer
GroupDailyAnalysis->>LLMAnalyzer: analyze_all_concurrent(messages,...)
LLMAnalyzer->>ConfigManager: _get_group(llm)
ConfigManager-->>LLMAnalyzer: config_llm_group
LLMAnalyzer->>LLMAnalyzer: _get_analysis_task_timeout_seconds()
LLMAnalyzer-->>LLMAnalyzer: task_timeout_seconds
par topic_task
LLMAnalyzer->>LLMAnalyzer: _run_analysis_task_with_timeout(topic,..., timeout)
LLMAnalyzer->>TopicAnalyzer: analyze_topics(...)
alt completes_before_timeout
TopicAnalyzer-->>LLMAnalyzer: topic_result
LLMAnalyzer-->>LLMAnalyzer: topic_result_ok
else timeout
LLMAnalyzer-->>LLMAnalyzer: log_warning_timeout_topic
LLMAnalyzer-->>LLMAnalyzer: raise TimeoutError
end
and user_title_task
LLMAnalyzer->>LLMAnalyzer: _run_analysis_task_with_timeout(user_title,..., timeout)
LLMAnalyzer->>UserTitleAnalyzer: analyze_user_titles(...)
alt completes_before_timeout
UserTitleAnalyzer-->>LLMAnalyzer: user_title_result
else timeout
LLMAnalyzer-->>LLMAnalyzer: log_warning_timeout_user_title
LLMAnalyzer-->>LLMAnalyzer: raise TimeoutError
end
and golden_quote_task
LLMAnalyzer->>LLMAnalyzer: _run_analysis_task_with_timeout(golden_quote,..., timeout)
LLMAnalyzer->>GoldenQuoteAnalyzer: analyze_golden_quotes(...)
alt completes_before_timeout
GoldenQuoteAnalyzer-->>LLMAnalyzer: golden_quote_result
else timeout
LLMAnalyzer-->>LLMAnalyzer: log_warning_timeout_golden_quote
LLMAnalyzer-->>LLMAnalyzer: raise TimeoutError
end
and chat_quality_task
LLMAnalyzer->>LLMAnalyzer: _run_analysis_task_with_timeout(chat_quality,..., timeout)
LLMAnalyzer->>ChatQualityAnalyzer: analyze_quality(...)
alt completes_before_timeout
ChatQualityAnalyzer-->>LLMAnalyzer: chat_quality_result
else timeout
LLMAnalyzer-->>LLMAnalyzer: log_warning_timeout_chat_quality
LLMAnalyzer-->>LLMAnalyzer: raise TimeoutError
end
end
LLMAnalyzer->>LLMAnalyzer: asyncio.gather(return_exceptions=True)
LLMAnalyzer->>LLMAnalyzer: inspect_results_and_log
LLMAnalyzer-->>GroupDailyAnalysis: partial_aggregated_analysis_result
Updated class diagram for LLMAnalyzer, ReportGenerator, and GroupDailyAnalysisclassDiagram
class LLMAnalyzer {
+TopicAnalyzer topic_analyzer
+UserTitleAnalyzer user_title_analyzer
+GoldenQuoteAnalyzer golden_quote_analyzer
+ChatQualityAnalyzer chat_quality_analyzer
+int DEFAULT_ANALYSIS_TASK_TIMEOUT_SECONDS
+LLMAnalyzer(context, config_manager)
+int _get_analysis_task_timeout_seconds()
+_run_analysis_task_with_timeout(task_name, coro, timeout_seconds)
+analyze_all_concurrent(messages, umo, user_activity, top_users, session_id, options)
+analyze_incremental_concurrent(messages, umo, user_activity, top_users, session_id, options)
}
class TopicAnalyzer {
+analyze_topics(messages, umo, session_id)
}
class UserTitleAnalyzer {
+analyze_user_titles(messages, user_activity, umo, top_users, session_id)
}
class GoldenQuoteAnalyzer {
+analyze_golden_quotes(messages, umo, session_id)
}
class ChatQualityAnalyzer {
+analyze_quality(messages, umo, session_id)
}
class ReportGenerator {
+tuple~str~ _t2i_endpoints_cache
+float _t2i_endpoints_cache_expire_at
+asyncio.Lock _t2i_endpoint_lock
+dict~str, object~ _t2i_renderer_cache
+ReportGenerator(config_manager, data_dir)
+list~str~ _get_t2i_render_endpoints()
+_get_t2i_renderer(endpoint)
+_render_html_via_t2i_endpoint(endpoint, html_content, image_options)
+str _normalize_t2i_endpoint(endpoint)
+str _extract_non_image_response_summary(image_data)
+generate_image_report(analysis_result, group_id, html_render, avatar_url_getter, nickname_getter)
}
class GroupDailyAnalysis {
+report_generator ReportGenerator
+handle_group_daily_report_request(group_id, output_format, adapter)
}
LLMAnalyzer --> TopicAnalyzer
LLMAnalyzer --> UserTitleAnalyzer
LLMAnalyzer --> GoldenQuoteAnalyzer
LLMAnalyzer --> ChatQualityAnalyzer
GroupDailyAnalysis --> LLMAnalyzer
GroupDailyAnalysis --> ReportGenerator
class Adapter {
+send_image(group_id, image_url, caption)
}
GroupDailyAnalysis --> Adapter
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- In
LLMAnalyzer._get_analysis_task_timeout_secondsyou're reaching into the config manager’s private_get_groupAPI; consider introducing a public helper or wrapper for this lookup so the analyzer isn’t coupled to an internal implementation detail. - The nested loops in
generate_image_reportoverrender_strategiesandrender_endpointswith repeated logging and timeout handling are getting fairly complex; consider extracting the per-endpoint/per-strategy attempt (including validation and HTML error summarization) into a small helper to keepgenerate_image_reportmore readable.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `LLMAnalyzer._get_analysis_task_timeout_seconds` you're reaching into the config manager’s private `_get_group` API; consider introducing a public helper or wrapper for this lookup so the analyzer isn’t coupled to an internal implementation detail.
- The nested loops in `generate_image_report` over `render_strategies` and `render_endpoints` with repeated logging and timeout handling are getting fairly complex; consider extracting the per-endpoint/per-strategy attempt (including validation and HTML error summarization) into a small helper to keep `generate_image_report` more readable.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
感谢贡献,有的部分好像有点不太符合我对插件的预期,你能说明一下具体发生的问题吗,看起来有点像多个问题一起反馈然后直接合到一个 PR 内了 LLM 超时应该是 astrbot llm provider 那边控制的 默认情况下仅一个 t2i 端点也是不太正常的,官方内置了两个 t2i 端点,如果 t2i 遇到问题还可以看文档自行部署,或者临时使用贡献者提供的 t2i 服务,插件内是否需要实现支持多 t2i 端点我没有考虑好,我更倾向于 astrbot 本体来控制这部分 不过 晚点我细看看代码 |
|
巡检了这份 PR,方向和 #157 / #174 的诉求高度相关:它把“单个 LLM 子任务卡死”“T2I 默认端点异常”“图片生成/发送无界等待”这几类故障都收束到了可降级路径,对生产环境很有帮助。 有两个建议供合入前参考:
整体看,这个 PR 可以作为 #157 的“基础降级能力”一部分,但 #157 里“失败后继续发送/发送指定错误消息/不发送”和失败表情仍需要单独配置策略来补齐。 |
|
定时协作巡检时看了一下,这个 PR 和此前 #174 里提到的复杂页面渲染超时、以及 #157 里提到的上游不稳定导致空报告/卡住的问题有明显关联,方向上很有帮助。 一个小建议:代码里已经支持从 另外,T2I endpoint fallback + 非图片 HTML 响应摘要这块对排查 Cloudflare 502 很实用,建议合入前再确认一下和项目现有的 |
Liangyu-G
left a comment
There was a problem hiding this comment.
维护者巡检:该 PR 聚焦报告生成韧性,解决 LLM 子任务卡死、T2I 上游异常返回 HTML、图片生成/发送无上限等待等真实生产问题;整体改动边界清晰,保留文本回退,风险可控。已检查主要变更文件与状态,当前无必需 CI。批准合入。
|
维护者巡检补充:本轮已复查并尝试合并该 PR。结论仍是功能边界清晰、风险可控,已批准;但执行 merge 时 GitHub API 返回 |
|
维护者巡检补充:该 PR 仍然建议合入,但当前 GitHub 接口对本仓库 merge/update 返回 Not Found,本轮无法直接执行 merge。请仓库 Owner 在网页端合并,或确认维护者 token 对该仓库的写权限。合入前建议顺手确认 main 上后续模板改动不会与 |
Summary
This PR improves the failure handling around group daily report generation.
In production, I hit a case where report generation looked "randomly broken" across different groups, but there were actually three separate failure modes in the same path:
asyncio.gather()indefinitely, so the whole report never completed.https://t2i.soulter.top/text2img) was returning a Cloudflare502 Bad gatewayHTML page instead of image bytes.This PR adds bounded timeouts, official T2I endpoint fallback, and clearer logging so these failures degrade predictably instead of stalling the entire command.
Root cause observed
While debugging a live AstrBot deployment on 2026-04-16, the plugin was able to produce reports for some groups but not others.
The main rendering issue was not a template bug. The primary upstream T2I service returned HTML error pages like:
<title>soulter.top | 502: Bad gateway</title>The plugin previously treated the response as failed/invalid image data and moved on without enough context, and it only targeted a single endpoint by default. At the same time, if one concurrent LLM analyzer call stalled, the whole report pipeline waited forever because
asyncio.gather(..., return_exceptions=True)still waits for every task to finish.What changed
1. Add per-analysis task timeouts
src/infrastructure/analysis/llm_analyzer.py240s) for each concurrent analysis subtask.asyncio.wait_for(...).This keeps one slow or stuck upstream LLM request from blocking the entire report forever.
2. Add official T2I endpoint discovery and fallback
src/infrastructure/reporting/generators.pyhttps://api.soulter.top/astrbot/t2i-endpoints.HtmlRendererinstances per endpoint.This makes rendering resilient when the default public renderer is temporarily unhealthy.
3. Detect non-image HTML responses explicitly
src/infrastructure/reporting/generators.py<title>) when the renderer returns an error page instead of an image.This should make future upstream failures much easier to diagnose from logs.
4. Bound image generation and send operations
main.py240s).60s).This ensures the command path falls back to text output in a predictable amount of time instead of hanging indefinitely.
Behavior after this PR
With these changes:
Verification
Manual verification:
python -m py_compile main.py src/infrastructure/analysis/llm_analyzer.py src/infrastructure/reporting/generators.pyIf you'd like, I can also split this into multiple smaller PRs, but I kept it together because these failures were all observed in the same end-to-end report generation path.
Summary by Sourcery
Improve resilience and observability of group daily report generation by adding timeouts and more robust T2I endpoint handling so failures degrade gracefully instead of hanging.
New Features:
Enhancements: