Skip to content

fix: use cloned browser profile and pass user-agent to crawler contexts#755

Draft
younglim wants to merge 1 commit into
masterfrom
fix/crawler-useragent-and-profile
Draft

fix: use cloned browser profile and pass user-agent to crawler contexts#755
younglim wants to merge 1 commit into
masterfrom
fix/crawler-useragent-and-profile

Conversation

@younglim

Copy link
Copy Markdown
Collaborator

Summary

  • crawlDomain and crawlSitemap were creating empty sub-profile directories per browser launch, discarding the cloned Chrome profile data (cookies, preferences). The browser never benefited from the cloned profile.
  • Neither crawler passed process.env.OOBEE_USER_AGENT to the Playwright browser context, so WAF-protected sites (e.g. scdf.gov.sg) saw HeadlessChrome in the UA string and returned 403.
  • Extracted a shared getPreLaunchHook() into commonCrawlerFunc.ts to eliminate duplication between both crawlers.

Changes

File Change
src/crawlers/commonCrawlerFunc.ts New getPreLaunchHook(userDataDirectory) — removes stale SingletonLock, sets userDataDir to the actual cloned profile
src/crawlers/crawlDomain.ts Use shared hook, pass OOBEE_USER_AGENT in launchOptions, remove sub-profile creation
src/crawlers/crawlSitemap.ts Same as above

Root cause

Crawler Before After
crawlDomain.ts Created empty profile-{ts}-{rand}/ sub-dirs; UA never set Uses cloned profile directly; userAgent passed in preLaunchHooks
crawlSitemap.ts Same Same

Test plan

  • npx tsc --noEmit passes
  • Scan a normal site — verify no regression
  • Verify navigator.userAgent in crawled pages does not contain HeadlessChrome
  • Verify no SingletonLock errors on repeated scan runs

…er contexts

The domain and sitemap crawlers were creating empty sub-profile directories
per browser launch, discarding the cloned Chrome profile data. They also
never passed process.env.OOBEE_USER_AGENT to the browser context, causing
WAF-protected sites to see HeadlessChrome in the UA and return 403.

- Extract shared getPreLaunchHook() into commonCrawlerFunc that removes
  stale SingletonLock and sets userDataDir to the actual cloned profile
- Pass OOBEE_USER_AGENT in preLaunchHooks launchOptions for both crawlers
- Remove per-browser sub-profile creation (empty dirs served no purpose)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant