Skip to content

Handling lost connection with Playwright process that make scrape hangs in error #331

@milan-cp-dev

Description

@milan-cp-dev

I have long scrapes with long sequences of actions needed to be taken during playwright scrape. I have handled most of the problems with scrape and established long runes. I am now facing issues with lost connection with the playwright process. We sure, can’t do much about process that died but please help me ensure that such a request ends up in errorback so we can properly handle it and continue scraping.

Minimum spider setup is described in the minimal_spider_setyp.txt

Route cause of error is:
/opt/scrapy_enviroment/lib/python3.11/site-packages/playwright/driver/playwright.sh: line 6: 2323044 Hangup "$PLAYWRIGHT_NODEJS_PATH" "$SCRIPT_PATH/package/lib/cli/cli.js" "$@"
/opt/scrapy_enviroment/lib/python3.11/site-packages/playwright/driver/playwright.sh: line 6: 2323042 Hangup "$PLAYWRIGHT_NODEJS_PATH" "$SCRIPT_PATH/package/lib/cli/cli.js" "$@"

To be able to raise awareness about that we have used ScrapyPlaywrightMemoryUsageExtension and we caught it as shown in example inital_error.txt

We have extended ScrapyPlaywrightMemoryUsageExtension to be able to try/catch such exceptions. We have attempted to raise some scrapy playwright known error to be able to route it back to errorback function that should handle remaining and proceed with scrape.

Can you please evaluate our CustomScrapyPlaywrightMemoryUsageExtension and advise if IgnoreRequest is a suitable exception and suggest what we can do moving forward? We are debugging the current solution as I am reporting this now.

minimal_spider_setyp.txt
inital_error.txt
custom_memusage_extension.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions