-
Notifications
You must be signed in to change notification settings - Fork 142
Description
What is (possibly) going wrong?
Observing a mismatch between the (scrapy) Response
headers and the Response
body.
How to reproduce?
- Crawl a spider that would
yield
a single (scrapy)Request
configured to be handled byscrapy-playwright
as described in the readme. - In the
"playwright_page_methods"
meta, include aPageMethod
as follows:
[a] Pass to themethod
argument, either the string"click"
or aCallable
whichawait
calls the (playwright)Locator.click
method.
[b] In thePageMethod
orLocator
(as chosen in [a]), pass to theselector
argument, the selector of a hyperlink to a different page.
[tldr;] Make the Playwright click on a hyperlink to a page different than the one the Playwright originally landed on (after any redirects).
[note] One needs to ensure that their code waits for the completion of browser navigation to the linked page (until at least"commit"
).
What's happening in the background? (afaik)
In a simple example not involving any redirects, I believe the first (original) Request
explicitly sent by the spider creates a second, implicit request to the URL referred to by the hyperlink when the Playwright clicks on the link it received in response to the first request.
In a more nuanced example, the first request gets (HTTP 302) redirected to a second request, the Playwright creates a third, implicit request when it clicks on the link it received in response to the second request.
The first example of this I noticed in the wild involved four requests. Everything happened as described in the three-request example right above this paragraph, but the response to the third request was (yet another) HTTP 302 redirect, causing a fourth and final request to be sent.
Summary so far...
There are 4 requests in the chain, Req. 1
... Req. 4
. There are 2 pages, the <Interim_Page>
which contains the link the Playwright clicks on, and the <Requested_Page>
that the spider parses.
Req. 1
--[ redirect ]--> Req. 2
--[ ok ]--> <Interim_Page>
--[ click ]--> Req. 3
--[ redirect ]--> Req. 4
--[ ok ]--> <Requested_Page>
What is the result?
I am observing that the resulting (scrapy) Response
contains the body
of the response to Req. 4
, but the headers
of the response to Req. 2
. This confuses me, and maybe also confuses some scrapy
middleware, unless my understanding of the situation is incomplete.
What do I like about the result?
The final resulting body
should continue to correspond to Req. 4
by default as it is right now, in my opinion. Reasoning:
- Generally, when the server has set up such a redirection pattern, the final data that it would send in response to
Req. 4
would probably be highly similar to what it would have sent in response toReq. 1
had it skipped this redirection as a result of knowing that the client was returning to a page that they had previously visited in that same session. This validates my opinion that it is usually beneficial for the responsebody
by default to correspond to the last and final request in the chain, which wasReq. 4
in this case. - It seems like
scrapy-playwright
(or perhaps theRedirectMiddleware
controls this behavior) reacts to the HTTP 302 redirect responses in a reasonable and expected manner. The system is being made to believe that the responsebody
forReq. 2
was the responsebody
forReq. 1
, which makes sense because the browser redirected in response to a 302. The system is also told the responsebody
forReq. 4
was the responsebody
forReq. 3
for that same reason, as it should. At least in my scenario, thePageMethod
click betweenReq. 2
andReq. 3
was intended similarly, and at making progress towards the<Requested_Page>
, so it should continue to be treated similarly to a redirect by default. That is, the click should forward the responsibility of setting thebody
similar to how a redirect would. - The only response
body
that ends up reaching the spider's parser from my observations was the one in response toReq. 4
aka<Requested_Page>
, which further validates my point that theResponse
object should continue to refer to that samebody
. - This behavior works seamlessly with the core of scrapy as discussed right above in [3.], and also with the
HttpCacheMiddleware
. To elaborate, theresponse_url
in the cachedmeta
file is that of the response toReq. 4
, so the fact that the cachedresponse_body
file contains the responsebody
of none other thanReq. 4
promotes cache data consistency. - Finally, if a spider has instructions for executing a
PageMethod
to click on a link to a different page even before the current page is parsed, then I am happy with the default assumption that the author intended the click as a redirect unless explicitly stated otherwise by them.
There may be cases that I haven't encountered where one might want the configurability to not treat such clicks as redirects. However, I would like to keep the scope of this issue I'm raising limited to clearing my confusion around or fixing the mismatch between the body
and the headers
. The request for overriding the default behavior around body
can be raised as a separate feature request if folks start to find the lack of it bothersome enough.
What should we change about the result?
The headers
should match the body
.
If every clog of this system wants to treat those 4 requests as a single chain, then we should embrace the chain. Why stop forwarding the responsibility of setting
headers
atReq. 2
when we continue to forward the responsibility of settingbody
throughReq. 4
?When I inspected the value of the
Response
object and first noticed a differentbody
matched with differentheaders
, I wondered if it was a bug. I had no clue what was going on until my spider re-crawled many times, and I read through the logs and cache, monitored the network tab, and hypothesized the above "What's happening in the background? (afaik)" section.
Let us get the responsibility of setting headers
forwarded through the "click" of a hyperlink (to another route) facilitated by scrapy-playwright
, such that the final headers
that would get associated with the (scrapy) Response
would be that of the final response in the chain, and would thus match the body
associated with that same response as well.
Why does this matter?
- The Response 2 headers being in the same (scrapy)
Response
object as the Response 4 body just sounds wrong. I know thatHttpCacheMiddleware
also then stores the data wrong with the cachedresponse_headers
file matching neither the cachedresponse_body
file nor theresponse_url
in the cachedmeta
file. There may be layers of scrapy or some custom spider code one might write that could malfunction from breaking the assumption that theheaders
match thebody
and theurl
. - Particularly, I saw that this messed up the
RFC2616
cache policy in theHttpCacheMiddleware
, e.g., say, Response 2 contained cacheheaders
to indicate avoiding caching since the<Interim_Page>
was an on-the-fly response to the client's session state. However, say, Response 4, or the<Requested_Page>
, was relatively a static resource, so Response 4 contained cacheheaders
to indicate caching that resource for some time. With the Response 2headers
being incorrectly respected for the Response 4 resourcebody
, the static resource was mistaken for a dynamic resource by scrapy's caching layer. The cache policy that depends on accurate cacheheaders
to reduce the spider's network request rate with the server became less effective. This has the potential danger of angering the server administrators over repeatedly requesting static resources that their server encourages clients to cache.
Version Details
scrapy-playwright==0.0.42
Scrapy==2.12.0
playwright==1.50.0
python==3.13.2