Unexpected termination of scrapy-playwrigth jobs; caused by node:events:496

I've been debugging this problem for a while, it's intermittent making it harder to reproduce. 

When running some jobs with `scrapy-playwright` the jobs get's abruptly terminated, if you observe the log of the job, it doesn't even acknowledges the termination, as it would in a SIGTERM case. The process apparently gets killed.

As an example, a simple spider (with `scrapy-playwright`) for scraping webstaurant.com, here is how the log terminates (literally the last 3 lines)
```
2024-09-02 23:15:59 [scrapy.core.scraper] WARNING: Dropped: Duplicate item found: {...}
2024-09-02 23:16:02 [scrapy.extensions.logstats] INFO: Crawled 163 pages (at 0 pages/min), scraped 173 items (at 0 items/min)
2024-09-02 23:16:17 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.webstaurantstore.com/10-strawberry-street-arctic-9c-arctic-blue-6-oz-porcelain-cup-pack/850ARCTIC9C.html>: HTTP status code is not handled or not allowed
```

I first noticed the problem when running the jobs with `scrapyd`, and  here is what scrapyd logs when the problem happens:
```
2024-07-31T18:07:33+0000 [-] Process died: exitstatus=None  project='{my_project}' spider='webstaurant' job='ff9296c04f6211ef87de0242ac110004' pid=18659 args=['/usr/local/bin/python', '-m', 'scrapyd.runner', 'crawl', 'webstaurant', '-a', '_job=ff9296c04f6211ef87de0242ac110004', '-s', 'LOG_FILE=/var/lib/scrapyd/logs/{my_project}/webstaurant/ff9296c04f6211ef87de0242ac110004.log']
```
**This is just for extra data, the problem is unrelated to `scrapyd`, since it's reproducible without it.**

In all occurrences the error that seems to be the cause is a node error:
```
2024-09-02T23:16:29+0000 [Launcher,16/stderr] DEBUG:scrapy-playwright:[Context=default] Response: <200 https://assets.customer.io/assets/in-app.js>
2024-09-02T23:16:29+0000 [Launcher,16/stderr] DEBUG:scrapy-playwright:[Context=default] Request: <GET https://code.gist.build/web/latest/gist.min.js> (resource type: script, referrer: https://www.webstaurantstore.com/)
2024-09-02T23:16:29+0000 [Launcher,16/stderr] node:events:496
	      throw er; // Unhandled 'error' event
	      ^
	
	Error: write EPIPE
	    at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:94:16)
	Emitted 'error' event on Socket instance at:
	    at emitErrorNT (node:internal/streams/destroy:169:8)
	    at emitErrorCloseNT (node:internal/streams/destroy:128:3)
	    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
	  errno: -32,
	  code: 'EPIPE',
	  syscall: 'write'
	}
	
	Node.js v20.12.2
```

- This happens with other spiders as well, it's not related to a specific page/netloc.
- Some times it happens within the first 15 minutes, sometimes it takes more than 15 hours for the issue to happen. (For some reason it seems to happen quicker when using containers)
- I've reproduced this in docker containers, GCP Cloud Run and directly in a VPS. I'm confident this isn't related to resources as it happens with plenty of CPU/RAM available.
- Decreasing `PLAYWRIGHT_MAX_PAGES_PER_CONTEXT` and `PLAYWRIGHT_MAX_CONTEXTS` all the way to 1 had no effect
- I've tested in 3 different situations, running the same input in a **pure playwright script (no scrapy)** and the problem didn't occurred. All 3 times it concluded the job well after 24hours running. Which leads me to believe that the problem is somewhere between  the scrapy and playwright "interaction".
- At this point I've used versions from 0.0.38 to 0.0.41, all had the issue.

Finally, I found two issues in `python-playwright` that bear some resemblance, the first one appears logs the same exception and is caused by the handling of the event loop.

https://github.com/microsoft/playwright-python/issues/2275
https://github.com/microsoft/playwright-python/issues/2454


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unexpected termination of scrapy-playwrigth jobs; caused by node:events:496 #316

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unexpected termination of scrapy-playwrigth jobs; caused by node:events:496 #316

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions