Response body / header mismatch confusion when using `PageMethod` "click" on a link to a different page

### What is (possibly) going wrong?
Observing a mismatch between the (scrapy) `Response` headers and the  `Response` body.

### How to reproduce?

1. Crawl a spider that would `yield` a **single** (scrapy) `Request` configured to be handled by `scrapy-playwright` as described in [the readme](https://github.com/scrapy-plugins/scrapy-playwright/blob/main/README.md).
2. In the `"playwright_page_methods"` meta, include a `PageMethod` as follows:
         [a] Pass to the `method` argument, either the string `"click"` or a `Callable` which `await` calls the (playwright) `Locator.click` method.
        [b] In the `PageMethod` or `Locator` (as chosen in [a]), pass to the `selector` argument, the selector of a hyperlink to a different page.
        [tldr;] Make the Playwright click on a hyperlink to a page different than the one the Playwright originally landed on (after any redirects).
        [note] One needs to ensure that their code waits for the completion of browser navigation to the linked page (until at least `"commit"`).

### What's happening in the background? (afaik)

In a simple example not involving any redirects, I believe the first (original) `Request` explicitly sent by the spider creates a second, implicit request to the URL referred to by the hyperlink when the Playwright clicks on the link it received in response to the first request.

In a more nuanced example, the first request gets ([HTTP 302](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/302)) redirected to a second request, the Playwright creates a third, implicit request when it clicks on the link it received in response to the second request.

The first example of this I noticed in the wild involved four requests. Everything happened as described in the three-request example right above this paragraph, but the response to the third request was (yet another) [HTTP 302 redirect](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/302), causing a fourth and final request to be sent.

#### Summary so far...

There are 4 requests in the chain, `Req. 1` ... `Req. 4`. There are 2 pages, the `<Interim_Page>`  which contains the link the Playwright clicks on, and the `<Requested_Page>` that the spider parses.

`Req. 1`  --[ [redirect](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/302) ]--> `Req. 2` --[ [ok](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/200) ]--> `<Interim_Page>` --[ click ]--> `Req. 3` --[ [redirect](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/302) ]--> `Req. 4` --[ [ok](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/200) ]--> `<Requested_Page>`

### What is the result? 

I am observing that the resulting (scrapy) `Response` contains the `body` of the response to `Req. 4`, but the `headers` of the response to `Req. 2`. This confuses me, and maybe also confuses some `scrapy` middleware, unless my understanding of the situation is incomplete.

### What do I like about the result?

_The final resulting `body` should continue to correspond to `Req. 4`_ by default as it is right now, in my opinion. Reasoning:

1. Generally, when the server has set up such a redirection pattern, the final data that it would send in response to `Req. 4` would probably be highly similar to what it would have sent in response to `Req. 1` had it skipped this redirection as a result of knowing that the client was returning to a page that they had previously visited in that same session. This validates my opinion that it is usually beneficial for the response `body` by default to correspond to the last and final request in the chain, which was `Req. 4` in this case.
2. It seems like `scrapy-playwright` (or perhaps the `RedirectMiddleware` controls this behavior) reacts to the [HTTP 302 redirect](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/302) responses in a reasonable and expected manner. The system is being made to believe that the response `body` for `Req. 2` was the response `body` for `Req. 1`, which makes sense because the browser redirected in response to a [302](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/302). The system is also told the response `body` for `Req. 4` was the response `body` for `Req. 3` for that same reason, as it should. At least in my scenario, the `PageMethod` click between `Req. 2` and `Req. 3` was intended similarly, and at making progress towards the `<Requested_Page>`, so it should continue to be treated similarly to a redirect by default. That is, the click should forward the responsibility of setting the `body` similar to how a redirect would.
3. The only response `body` that ends up reaching the spider's parser from my observations was the one in response to `Req. 4` aka `<Requested_Page>`, which further validates my point that the `Response` object should continue to refer to that same `body`.
4. This behavior works seamlessly with the core of scrapy as discussed right above in [3.], and also with the `HttpCacheMiddleware`. To elaborate, the `response_url` in the cached `meta` file is that of the response to `Req. 4`, so the fact that the cached `response_body` file contains the response `body` of none other than `Req. 4` promotes cache data consistency.
5. Finally, if a spider has instructions for executing a `PageMethod` to click on a link to a different page even before the current page is parsed, then I am happy with the default assumption that the author intended the click as a redirect unless explicitly stated otherwise by them.

There may be cases that I haven't encountered where one might want the configurability to not treat such clicks as redirects. However, I would like to keep the scope of this issue I'm raising limited to clearing my confusion around or fixing the mismatch between the `body` and the `headers`. The request for overriding the default behavior around `body` can be raised as a separate feature request if folks start to find the lack of it bothersome enough.

### What should we change about the result?

_The `headers` should match the `body`._

> If every clog of this system wants to treat those 4 requests as a single chain, then we should embrace the chain. Why stop forwarding the responsibility of setting `headers` at `Req. 2` when we continue to forward the responsibility of setting `body` through `Req. 4`?
> 
> When I inspected the value of the `Response` object and first noticed a different `body` matched with different `headers`, I wondered if it was a bug. I had no clue what was going on until my spider re-crawled many times, and I read through the logs and cache, monitored the network tab, and hypothesized the above "What's happening in the background? (afaik)" section.

Let us get the responsibility of setting `headers` forwarded through the "click" of a hyperlink (to another route) facilitated by `scrapy-playwright`, such that the final `headers` that would get associated with the (scrapy) `Response` would be that of the final response in the chain, and would thus match the `body` associated with that same response as well. 

### Why does this matter?

1. The Response 2 headers being in the same (scrapy) `Response` object as the Response 4 body just sounds wrong. I know that `HttpCacheMiddleware` also then stores the data wrong with the cached `response_headers` file matching neither the cached `response_body` file nor the `response_url` in the cached `meta` file. There may be layers of scrapy or some custom spider code one might write that could malfunction from breaking the assumption that the `headers` match the `body` and the `url`.
2. Particularly, I saw that this messed up the `RFC2616` cache policy in the `HttpCacheMiddleware`, e.g., say, Response 2 contained cache `headers` to indicate avoiding caching since the `<Interim_Page>` was an on-the-fly response to the client's session state. However, say, Response 4, or the `<Requested_Page>`, was relatively a static resource, so Response 4 contained cache `headers` to indicate caching that resource for some time. With the Response 2 `headers` being incorrectly respected for the Response 4 resource `body`, the static resource was mistaken for a dynamic resource by scrapy's caching layer. The cache policy that depends on accurate cache `headers` to reduce the spider's network request rate with the server became less effective. This has the potential danger of angering the server administrators over repeatedly requesting static resources that their server encourages clients to cache.

----

### Version Details

```
scrapy-playwright==0.0.42
Scrapy==2.12.0
playwright==1.50.0
python==3.13.2
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Response body / header mismatch confusion when using `PageMethod` "click" on a link to a different page #341

What is (possibly) going wrong?

How to reproduce?

What's happening in the background? (afaik)

Summary so far...

What is the result?

What do I like about the result?

What should we change about the result?

Why does this matter?

Version Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Response body / header mismatch confusion when using PageMethod "click" on a link to a different page #341

Description

What is (possibly) going wrong?

How to reproduce?

What's happening in the background? (afaik)

Summary so far...

What is the result?

What do I like about the result?

What should we change about the result?

Why does this matter?

Version Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Response body / header mismatch confusion when using `PageMethod` "click" on a link to a different page #341