-
Notifications
You must be signed in to change notification settings - Fork 142
Description
I've been debugging this problem for a while, it's intermittent making it harder to reproduce.
When running some jobs with scrapy-playwright
the jobs get's abruptly terminated, if you observe the log of the job, it doesn't even acknowledges the termination, as it would in a SIGTERM case. The process apparently gets killed.
As an example, a simple spider (with scrapy-playwright
) for scraping webstaurant.com, here is how the log terminates (literally the last 3 lines)
2024-09-02 23:15:59 [scrapy.core.scraper] WARNING: Dropped: Duplicate item found: {...}
2024-09-02 23:16:02 [scrapy.extensions.logstats] INFO: Crawled 163 pages (at 0 pages/min), scraped 173 items (at 0 items/min)
2024-09-02 23:16:17 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.webstaurantstore.com/10-strawberry-street-arctic-9c-arctic-blue-6-oz-porcelain-cup-pack/850ARCTIC9C.html>: HTTP status code is not handled or not allowed
I first noticed the problem when running the jobs with scrapyd
, and here is what scrapyd logs when the problem happens:
2024-07-31T18:07:33+0000 [-] Process died: exitstatus=None project='{my_project}' spider='webstaurant' job='ff9296c04f6211ef87de0242ac110004' pid=18659 args=['/usr/local/bin/python', '-m', 'scrapyd.runner', 'crawl', 'webstaurant', '-a', '_job=ff9296c04f6211ef87de0242ac110004', '-s', 'LOG_FILE=/var/lib/scrapyd/logs/{my_project}/webstaurant/ff9296c04f6211ef87de0242ac110004.log']
This is just for extra data, the problem is unrelated to scrapyd
, since it's reproducible without it.
In all occurrences the error that seems to be the cause is a node error:
2024-09-02T23:16:29+0000 [Launcher,16/stderr] DEBUG:scrapy-playwright:[Context=default] Response: <200 https://assets.customer.io/assets/in-app.js>
2024-09-02T23:16:29+0000 [Launcher,16/stderr] DEBUG:scrapy-playwright:[Context=default] Request: <GET https://code.gist.build/web/latest/gist.min.js> (resource type: script, referrer: https://www.webstaurantstore.com/)
2024-09-02T23:16:29+0000 [Launcher,16/stderr] node:events:496
throw er; // Unhandled 'error' event
^
Error: write EPIPE
at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:94:16)
Emitted 'error' event on Socket instance at:
at emitErrorNT (node:internal/streams/destroy:169:8)
at emitErrorCloseNT (node:internal/streams/destroy:128:3)
at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
errno: -32,
code: 'EPIPE',
syscall: 'write'
}
Node.js v20.12.2
- This happens with other spiders as well, it's not related to a specific page/netloc.
- Some times it happens within the first 15 minutes, sometimes it takes more than 15 hours for the issue to happen. (For some reason it seems to happen quicker when using containers)
- I've reproduced this in docker containers, GCP Cloud Run and directly in a VPS. I'm confident this isn't related to resources as it happens with plenty of CPU/RAM available.
- Decreasing
PLAYWRIGHT_MAX_PAGES_PER_CONTEXT
andPLAYWRIGHT_MAX_CONTEXTS
all the way to 1 had no effect - I've tested in 3 different situations, running the same input in a pure playwright script (no scrapy) and the problem didn't occurred. All 3 times it concluded the job well after 24hours running. Which leads me to believe that the problem is somewhere between the scrapy and playwright "interaction".
- At this point I've used versions from 0.0.38 to 0.0.41, all had the issue.
Finally, I found two issues in python-playwright
that bear some resemblance, the first one appears logs the same exception and is caused by the handling of the event loop.
microsoft/playwright-python#2275
microsoft/playwright-python#2454