Skip to content

Improve report generation resilience#171

Open
constansino wants to merge 1 commit into
SXP-Simon:mainfrom
constansino:fix/report-generation-resilience
Open

Improve report generation resilience#171
constansino wants to merge 1 commit into
SXP-Simon:mainfrom
constansino:fix/report-generation-resilience

Conversation

@constansino

@constansino constansino commented Apr 16, 2026

Copy link
Copy Markdown

Summary

This PR improves the failure handling around group daily report generation.

In production, I hit a case where report generation looked "randomly broken" across different groups, but there were actually three separate failure modes in the same path:

  1. One stuck LLM subtask could block asyncio.gather() indefinitely, so the whole report never completed.
  2. The default public T2I endpoint (https://t2i.soulter.top/text2img) was returning a Cloudflare 502 Bad gateway HTML page instead of image bytes.
  3. Image generation and image sending did not have bounded timeouts, so the request path could hang for a long time before eventually falling back.

This PR adds bounded timeouts, official T2I endpoint fallback, and clearer logging so these failures degrade predictably instead of stalling the entire command.

Root cause observed

While debugging a live AstrBot deployment on 2026-04-16, the plugin was able to produce reports for some groups but not others.

The main rendering issue was not a template bug. The primary upstream T2I service returned HTML error pages like:

  • <title>soulter.top | 502: Bad gateway</title>

The plugin previously treated the response as failed/invalid image data and moved on without enough context, and it only targeted a single endpoint by default. At the same time, if one concurrent LLM analyzer call stalled, the whole report pipeline waited forever because asyncio.gather(..., return_exceptions=True) still waits for every task to finish.

What changed

1. Add per-analysis task timeouts

src/infrastructure/analysis/llm_analyzer.py

  • Add a default per-task timeout (240s) for each concurrent analysis subtask.
  • Wrap topic / user title / golden quote / chat quality analysis calls with asyncio.wait_for(...).
  • Log timed out subtasks as warnings and continue assembling the report from the remaining successful subtasks.
  • Apply the same timeout behavior to incremental analysis mode.

This keeps one slow or stuck upstream LLM request from blocking the entire report forever.

2. Add official T2I endpoint discovery and fallback

src/infrastructure/reporting/generators.py

  • Fetch and cache official T2I endpoints from https://api.soulter.top/astrbot/t2i-endpoints.
  • Normalize and deduplicate discovered endpoints.
  • When the current primary endpoint is the default public endpoint, prefer trying official backup endpoints first.
  • Cache HtmlRenderer instances per endpoint.
  • Try every render strategy against every available endpoint instead of betting everything on one public node.

This makes rendering resilient when the default public renderer is temporarily unhealthy.

3. Detect non-image HTML responses explicitly

src/infrastructure/reporting/generators.py

  • Inspect returned bytes / temp files for HTML content.
  • Extract a concise summary (for example the HTML <title>) when the renderer returns an error page instead of an image.
  • Log those responses explicitly so operators can distinguish "renderer returned HTML 502" from generic invalid-image failures.

This should make future upstream failures much easier to diagnose from logs.

4. Bound image generation and send operations

main.py

  • Add an overall timeout for image report generation (240s).
  • Add an overall timeout for image sending (60s).
  • Add clearer lifecycle logs around generation and send success/failure.

This ensures the command path falls back to text output in a predictable amount of time instead of hanging indefinitely.

Behavior after this PR

With these changes:

  • A single stuck LLM subtask no longer blocks the whole report.
  • A dead/unhealthy default T2I endpoint no longer takes down image rendering by itself.
  • HTML error pages from the renderer are visible in logs as upstream failures instead of vague invalid-image results.
  • The plugin still preserves the existing text fallback behavior when image generation or image sending ultimately fails.

Verification

Manual verification:

  • python -m py_compile main.py src/infrastructure/analysis/llm_analyzer.py src/infrastructure/reporting/generators.py
  • Deployed the same patch set to a live AstrBot instance and confirmed the plugin resumed generating reports successfully after the renderer fallback logic was added.

If you'd like, I can also split this into multiple smaller PRs, but I kept it together because these failures were all observed in the same end-to-end report generation path.

Summary by Sourcery

Improve resilience and observability of group daily report generation by adding timeouts and more robust T2I endpoint handling so failures degrade gracefully instead of hanging.

New Features:

  • Support configurable per-analysis LLM task timeouts to prevent a single stalled subtask from blocking the whole report.
  • Discover and prioritize official T2I endpoints, with per-endpoint HtmlRenderer caching and multi-endpoint rendering attempts for image reports.
  • Explicitly detect and summarize non-image HTML responses from T2I services to surface upstream errors in logs.
  • Add global time limits for image report generation and image sending to ensure timely fallback to text output.

Enhancements:

  • Enhance logging around LLM analysis timeouts, T2I endpoint behavior, and image report lifecycle to aid production diagnosis.

Add per-analysis task timeouts so a single stuck LLM subtask does not block the whole report forever.

Discover and cache official T2I endpoints, prefer backup endpoints when the default public endpoint is unhealthy, and surface non-image HTML error pages in logs for faster diagnosis.

Also bound image generation and image sending with explicit timeouts so failed renders fall back to text output predictably instead of hanging the request path.
@sourcery-ai

sourcery-ai Bot commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

Reviewer's Guide

Improves robustness of group daily report generation by adding per-task timeouts to LLM analyses, introducing T2I endpoint discovery and fallback with better validation and logging of non-image responses, and enforcing bounded timeouts around image report generation and sending.

Sequence diagram for resilient image report generation with timeouts and T2I fallback

sequenceDiagram
  actor User
  participant GroupDailyAnalysis
  participant LLMAnalyzer
  participant ReportGenerator
  participant OfficialEndpointsAPI
  participant T2IEndpoint
  participant Adapter

  User->>GroupDailyAnalysis: request_group_daily_report
  GroupDailyAnalysis->>LLMAnalyzer: analyze_all_concurrent(messages,...)
  LLMAnalyzer-->>GroupDailyAnalysis: analysis_result

  alt output_format_image
    GroupDailyAnalysis->>GroupDailyAnalysis: asyncio.wait_for(generate_image_report, 240s)
    GroupDailyAnalysis->>ReportGenerator: generate_image_report(analysis_result, group_id, html_render,...)

    ReportGenerator->>ReportGenerator: _get_t2i_render_endpoints()
    alt endpoints_cache_valid
      ReportGenerator-->>ReportGenerator: use_cached_endpoints
    else cache_miss
      ReportGenerator->>OfficialEndpointsAPI: GET /astrbot/t2i-endpoints
      OfficialEndpointsAPI-->>ReportGenerator: active endpoint list
      ReportGenerator->>ReportGenerator: normalize_deduplicate_and_cache_endpoints
    end

    loop image_options_strategies
      loop render_endpoints
        GroupDailyAnalysis->>ReportGenerator: try_strategy_with_endpoint
        ReportGenerator->>T2IEndpoint: render_custom_template(html, options)
        alt within_75s_timeout
          T2IEndpoint-->>ReportGenerator: image_or_error_data
          alt valid_image_bytes_or_file
            ReportGenerator-->>GroupDailyAnalysis: image_url, html_content
            GroupDailyAnalysis-->>Adapter: asyncio.wait_for(send_image, 60s)
            Adapter-->>GroupDailyAnalysis: sent_success
            GroupDailyAnalysis-->>User: image_report_sent
          else non_image_html_or_invalid_data
            ReportGenerator->>ReportGenerator: _extract_non_image_response_summary
            ReportGenerator->>ReportGenerator: log_upstream_html_failure
          end
        else strategy_or_endpoint_timeout
          ReportGenerator->>ReportGenerator: log_timeout_and_try_next
        end
      end
    end

    alt all_strategies_failed
      ReportGenerator-->>GroupDailyAnalysis: image_url None
      GroupDailyAnalysis-->>User: fallback_to_text_report
    end

  else output_format_text
    GroupDailyAnalysis-->>User: text_report
  end
Loading

Sequence diagram for concurrent LLM analysis with per-task timeouts

sequenceDiagram
  participant GroupDailyAnalysis
  participant LLMAnalyzer
  participant ConfigManager
  participant TopicAnalyzer
  participant UserTitleAnalyzer
  participant GoldenQuoteAnalyzer
  participant ChatQualityAnalyzer

  GroupDailyAnalysis->>LLMAnalyzer: analyze_all_concurrent(messages,...)
  LLMAnalyzer->>ConfigManager: _get_group(llm)
  ConfigManager-->>LLMAnalyzer: config_llm_group
  LLMAnalyzer->>LLMAnalyzer: _get_analysis_task_timeout_seconds()
  LLMAnalyzer-->>LLMAnalyzer: task_timeout_seconds

  par topic_task
    LLMAnalyzer->>LLMAnalyzer: _run_analysis_task_with_timeout(topic,..., timeout)
    LLMAnalyzer->>TopicAnalyzer: analyze_topics(...)
    alt completes_before_timeout
      TopicAnalyzer-->>LLMAnalyzer: topic_result
      LLMAnalyzer-->>LLMAnalyzer: topic_result_ok
    else timeout
      LLMAnalyzer-->>LLMAnalyzer: log_warning_timeout_topic
      LLMAnalyzer-->>LLMAnalyzer: raise TimeoutError
    end
  and user_title_task
    LLMAnalyzer->>LLMAnalyzer: _run_analysis_task_with_timeout(user_title,..., timeout)
    LLMAnalyzer->>UserTitleAnalyzer: analyze_user_titles(...)
    alt completes_before_timeout
      UserTitleAnalyzer-->>LLMAnalyzer: user_title_result
    else timeout
      LLMAnalyzer-->>LLMAnalyzer: log_warning_timeout_user_title
      LLMAnalyzer-->>LLMAnalyzer: raise TimeoutError
    end
  and golden_quote_task
    LLMAnalyzer->>LLMAnalyzer: _run_analysis_task_with_timeout(golden_quote,..., timeout)
    LLMAnalyzer->>GoldenQuoteAnalyzer: analyze_golden_quotes(...)
    alt completes_before_timeout
      GoldenQuoteAnalyzer-->>LLMAnalyzer: golden_quote_result
    else timeout
      LLMAnalyzer-->>LLMAnalyzer: log_warning_timeout_golden_quote
      LLMAnalyzer-->>LLMAnalyzer: raise TimeoutError
    end
  and chat_quality_task
    LLMAnalyzer->>LLMAnalyzer: _run_analysis_task_with_timeout(chat_quality,..., timeout)
    LLMAnalyzer->>ChatQualityAnalyzer: analyze_quality(...)
    alt completes_before_timeout
      ChatQualityAnalyzer-->>LLMAnalyzer: chat_quality_result
    else timeout
      LLMAnalyzer-->>LLMAnalyzer: log_warning_timeout_chat_quality
      LLMAnalyzer-->>LLMAnalyzer: raise TimeoutError
    end
  end

  LLMAnalyzer->>LLMAnalyzer: asyncio.gather(return_exceptions=True)
  LLMAnalyzer->>LLMAnalyzer: inspect_results_and_log
  LLMAnalyzer-->>GroupDailyAnalysis: partial_aggregated_analysis_result
Loading

Updated class diagram for LLMAnalyzer, ReportGenerator, and GroupDailyAnalysis

classDiagram
  class LLMAnalyzer {
    +TopicAnalyzer topic_analyzer
    +UserTitleAnalyzer user_title_analyzer
    +GoldenQuoteAnalyzer golden_quote_analyzer
    +ChatQualityAnalyzer chat_quality_analyzer
    +int DEFAULT_ANALYSIS_TASK_TIMEOUT_SECONDS
    +LLMAnalyzer(context, config_manager)
    +int _get_analysis_task_timeout_seconds()
    +_run_analysis_task_with_timeout(task_name, coro, timeout_seconds)
    +analyze_all_concurrent(messages, umo, user_activity, top_users, session_id, options)
    +analyze_incremental_concurrent(messages, umo, user_activity, top_users, session_id, options)
  }

  class TopicAnalyzer {
    +analyze_topics(messages, umo, session_id)
  }

  class UserTitleAnalyzer {
    +analyze_user_titles(messages, user_activity, umo, top_users, session_id)
  }

  class GoldenQuoteAnalyzer {
    +analyze_golden_quotes(messages, umo, session_id)
  }

  class ChatQualityAnalyzer {
    +analyze_quality(messages, umo, session_id)
  }

  class ReportGenerator {
    +tuple~str~ _t2i_endpoints_cache
    +float _t2i_endpoints_cache_expire_at
    +asyncio.Lock _t2i_endpoint_lock
    +dict~str, object~ _t2i_renderer_cache
    +ReportGenerator(config_manager, data_dir)
    +list~str~ _get_t2i_render_endpoints()
    +_get_t2i_renderer(endpoint)
    +_render_html_via_t2i_endpoint(endpoint, html_content, image_options)
    +str _normalize_t2i_endpoint(endpoint)
    +str _extract_non_image_response_summary(image_data)
    +generate_image_report(analysis_result, group_id, html_render, avatar_url_getter, nickname_getter)
  }

  class GroupDailyAnalysis {
    +report_generator ReportGenerator
    +handle_group_daily_report_request(group_id, output_format, adapter)
  }

  LLMAnalyzer --> TopicAnalyzer
  LLMAnalyzer --> UserTitleAnalyzer
  LLMAnalyzer --> GoldenQuoteAnalyzer
  LLMAnalyzer --> ChatQualityAnalyzer

  GroupDailyAnalysis --> LLMAnalyzer
  GroupDailyAnalysis --> ReportGenerator

  class Adapter {
    +send_image(group_id, image_url, caption)
  }

  GroupDailyAnalysis --> Adapter
Loading

File-Level Changes

Change Details Files
Add configurable per-analysis-task timeouts and integrate them into concurrent and incremental LLM analysis flows so a single slow subtask cannot block the entire report.
  • Introduce DEFAULT_ANALYSIS_TASK_TIMEOUT_SECONDS and a config-driven getter for per-task timeout seconds.
  • Wrap each topic, user_title, golden_quote, and chat_quality coroutine in a timeout helper using asyncio.wait_for.
  • Propagate timeout handling into both analyze_all_concurrent and analyze_incremental_concurrent, logging timeouts as warnings while skipping failed subtasks instead of failing the whole analysis.
src/infrastructure/analysis/llm_analyzer.py
Implement resilient T2I endpoint discovery, caching, and per-endpoint HtmlRenderer reuse, and try all strategies across all endpoints with per-strategy timeouts.
  • Normalize and cache T2I endpoints, including a default public endpoint and dynamically fetched official endpoints with TTL-based caching and a lock for concurrent access.
  • Build and reuse HtmlRenderer instances keyed by normalized endpoint, and add a helper to render HTML via a specific endpoint.
  • Adjust generate_image_report to first resolve candidate endpoints, then iterate over render strategies and endpoints with a bounded per-call timeout, logging which endpoint/strategy is being attempted and tracking the last failure.
src/infrastructure/reporting/generators.py
Explicitly detect and summarize non-image HTML responses from T2I to improve observability of upstream failures.
  • Add a helper that inspects returned bytes or temp files, detects probable HTML, and extracts a compact <title> or truncated text summary.
  • Augment image validation to log when magic numbers do not match known image formats and use the HTML summary to construct clearer warning messages and synthetic exceptions for non-image responses.
  • Refine logging in generate_image_report to include endpoint, strategy, timeout, and reason when responses are invalid, HTML, or time out.
src/infrastructure/reporting/generators.py
Bound image report generation and sending durations in the main plugin flow and improve lifecycle logging for these operations.
  • Introduce IMAGE_REPORT_GENERATION_TIMEOUT_SECONDS and IMAGE_REPORT_SEND_TIMEOUT_SECONDS constants.
  • Wrap generate_image_report and adapter.send_image calls in asyncio.wait_for with appropriate exception handling and logging for timeouts and general failures.
  • Add info-level logs before generation, before send, and on success to make the end-to-end image path easier to trace, while falling back to text output when image creation or send fails.
main.py

Possibly linked issues

  • #[Bug]无法正常生成图片,回退到文本发送总结: Issue 报告 T2I 图片报告生成失败并回退文本,本 PR通过端点发现、回退与超时机制解决。

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • In LLMAnalyzer._get_analysis_task_timeout_seconds you're reaching into the config manager’s private _get_group API; consider introducing a public helper or wrapper for this lookup so the analyzer isn’t coupled to an internal implementation detail.
  • The nested loops in generate_image_report over render_strategies and render_endpoints with repeated logging and timeout handling are getting fairly complex; consider extracting the per-endpoint/per-strategy attempt (including validation and HTML error summarization) into a small helper to keep generate_image_report more readable.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `LLMAnalyzer._get_analysis_task_timeout_seconds` you're reaching into the config manager’s private `_get_group` API; consider introducing a public helper or wrapper for this lookup so the analyzer isn’t coupled to an internal implementation detail.
- The nested loops in `generate_image_report` over `render_strategies` and `render_endpoints` with repeated logging and timeout handling are getting fairly complex; consider extracting the per-endpoint/per-strategy attempt (including validation and HTML error summarization) into a small helper to keep `generate_image_report` more readable.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@SXP-Simon

Copy link
Copy Markdown
Owner

感谢贡献,有的部分好像有点不太符合我对插件的预期,你能说明一下具体发生的问题吗,看起来有点像多个问题一起反馈然后直接合到一个 PR 内了

LLM 超时应该是 astrbot llm provider 那边控制的

默认情况下仅一个 t2i 端点也是不太正常的,官方内置了两个 t2i 端点,如果 t2i 遇到问题还可以看文档自行部署,或者临时使用贡献者提供的 t2i 服务,插件内是否需要实现支持多 t2i 端点我没有考虑好,我更倾向于 astrbot 本体来控制这部分

不过 <title>soulter.top | 502: Bad gateway</title> 这种信息很有参考价值,感谢你的反馈

晚点我细看看代码

SXP-Simon added a commit that referenced this pull request Apr 23, 2026
1. 支持配置两轮 T2I 渲染策略,允许用户自定义图片格式(PNG/JPEG)、质量、分辨率及超时时间。
2. 实现 T2I 返回 HTML 错误页的自动识别与摘要提取(如 502 Bad Gateway),提升故障排查效率。

- 参考 #174 关于 T2I 动态超时与配置的建议
- 参考 #171 关于处理 T2I 返回 HTML 错误信息的建议
- 解决了 [Feature Request] 支持配置 PNG/JPEG 渲染优先级,避免大报告场景下因本地大图保护而回退为文件发送 #164
@Liangyu-G

Copy link
Copy Markdown
Contributor

巡检了这份 PR,方向和 #157 / #174 的诉求高度相关:它把“单个 LLM 子任务卡死”“T2I 默认端点异常”“图片生成/发送无界等待”这几类故障都收束到了可降级路径,对生产环境很有帮助。

有两个建议供合入前参考:

  1. 建议把超时配置补进 _conf_schema.json
    LLMAnalyzer._get_analysis_task_timeout_seconds() 已经尝试读取 llm.analysis_task_timeout_seconds,但当前 PR 似乎没有同步 schema。这样高级用户无法在面板里调整,实际只能用默认 240s。建议补一个 int 配置项(例如默认 240、hint 说明单个 topic/user title/golden quote/quality 子任务超时),这样和 PR 描述里的“configurable”一致。

  2. T2I 官方端点发现的外部请求最好保留“可观测但不致命”的语义
    当前失败会 warning 并回退 primary endpoint,这点是好的。建议 README/配置说明里也提示:如果用户自建了稳定 T2I base URL,官方端点只作为候选/兜底,不应改变自建优先级。现在代码在 primary 不是默认公共端点时会把 primary 放在第一位,这个行为我看是合理的。

整体看,这个 PR 可以作为 #157 的“基础降级能力”一部分,但 #157 里“失败后继续发送/发送指定错误消息/不发送”和失败表情仍需要单独配置策略来补齐。

@Liangyu-G

Copy link
Copy Markdown
Contributor

定时协作巡检时看了一下,这个 PR 和此前 #174 里提到的复杂页面渲染超时、以及 #157 里提到的上游不稳定导致空报告/卡住的问题有明显关联,方向上很有帮助。

一个小建议:代码里已经支持从 llm.analysis_task_timeout_seconds 读取单个分析子任务超时时间,但当前 PR 没有同步更新 _conf_schema.json。如果希望用户可调,建议补一个配置项;如果暂时不想暴露配置,也可以在 PR 描述里明确当前是隐藏/高级配置,默认 240s。

另外,T2I endpoint fallback + 非图片 HTML 响应摘要这块对排查 Cloudflare 502 很实用,建议合入前再确认一下和项目现有的 t2i_rendering 两轮策略文案保持一致即可。

@Liangyu-G Liangyu-G left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

维护者巡检:该 PR 聚焦报告生成韧性,解决 LLM 子任务卡死、T2I 上游异常返回 HTML、图片生成/发送无上限等待等真实生产问题;整体改动边界清晰,保留文本回退,风险可控。已检查主要变更文件与状态,当前无必需 CI。批准合入。

@Liangyu-G

Copy link
Copy Markdown
Contributor

维护者巡检补充:本轮已复查并尝试合并该 PR。结论仍是功能边界清晰、风险可控,已批准;但执行 merge 时 GitHub API 返回 Not Found,未能完成合并。建议仓库维护者在网页端确认是否存在分支权限/保护规则/外部 fork merge 权限限制,或将该变更 cherry-pick 到主仓分支后合入。

@Liangyu-G

Copy link
Copy Markdown
Contributor

维护者巡检补充:该 PR 仍然建议合入,但当前 GitHub 接口对本仓库 merge/update 返回 Not Found,本轮无法直接执行 merge。请仓库 Owner 在网页端合并,或确认维护者 token 对该仓库的写权限。合入前建议顺手确认 main 上后续模板改动不会与 generators.py 的 T2I fallback 冲突;该 PR 没有迁移配置风险,失败路径仍会回退文本。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants